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Preface 



Since 1991, the European Conference on Symbolic and Quantitative Approa- 
ches to Reasoning with Uncertainty (ECSQARU) has been a major forum for 
advances in the theory and practice of reasoning and decision making under un- 
certainty. The scope of ECSQARU is wide and includes, but is not limited to, 
fundamental issues, representation, inference, learning, and decision making in 
qualitative and numeric paradigms. The first ECSQARU conference (1991) was 
held in Marseilles, and since then it has been held in Granada (1993), Fribourg 
(1995), Bonn (1997), London (1999) and Toulouse (2001). 

This volume contains the papers that were presented at ECSQARU 2003, 
held at Aalborg University, Denmark, from July 2 to July 5, 2003. The papers 
went through a rigorous reviewing process: three program committee members 
reviewed each paper monitored by an area chair, who made a final recommen- 
dation to the program co-chairs. In addition to the regular presentations, the 
technical program for ECSQARU 2003 also included talks by three distinguis- 
hed invited speakers: Didier Dubois, Philippe Smets and Jeroen Vermunt. Didier 
Dubois and Jeroen Vermunt also contributed to this volume with papers on the 
subjects of their talks. 

As a continuation of tradition, an affiliated workshop was held prior to the 
conference itself. The workshop was entitled “Uncertainty, Incompleteness, Im- 
precision and Conflict in Multiple Data Sources” and it was organized by Weiru 
Liu (UK), Laurence Cholvy (France), Salem Benferhat (France) and Anthony 
Hunter (UK). ECSQARU 2003 also included a special software demo session 
which was intended to promote software development activities in the area of 
tools supporting symbolic and quantitative approaches to reasoning with uncer- 
tainty. This volume contains two special papers that describe two of the presen- 
tations. 

Finally, we would like to thank the area chairs, the members of the program 
committee and all the additional referees for their work. We would also like to 
thank Regitze Larsen and Lene Mogensen for their help with the organization 
of the conference. 
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Abstract. This paper is a survey of qualitative decision theory focused on the 
available decision rules under uncertainty, and their properties. It is pointed out 
that two main approaches exist according to whether degrees of uncertainty and 
degrees of utility are commensurate (that is, belong to a unique scale) or not. 
Savage-like axiomatics for both approaches are surveyed. In such a framework, 
acts are functions from states to results, and decision rules are used to charac- 
terize a preference relation on acts. It is shown that the emerging uncertainty 
theory in qualitative settings is possibility theory rather than prohahility theory. 
However these approaches lead to criteria that are either little decisive due to 
incomparahilities, or too adventurous because focusing on the most plausible 
states, or yet lacking discrimination because or the coarseness of the value 
scale. Some new results overcoming these defects are reviewed. Interestingly, 
they lead to genuine qualitative counterparts to expected utility. 



1 Introduction 

Traditionally, decision making under uncertainty (DMU) relies on a probabilistic 
framework. The ranking of acts is done according to the expected utility of the conse- 
quences of these acts. This proposal was made by economists in the 1950's, and justi- 
fied on an axiomatic basis by Savage [36] and colleagues. More recently, in Artificial 
Intelligence, this setting has been applied to problems of planning under uncertainty, 
and is at the root of the influence diagram methodology for multiple stage decision 
problems. However, in parallel to these developments. Artificial Intelligence has 
witnessed the emergence of a new decision paradigm called qualitative decision the- 
ory, where the rationale for choosing among decisions no longer relies on probability 
theory nor numerical utility functions [7]. Motivations for this new proposal are two- 
fold. On the one hand, there exists a tradition of symbolic processing of information 
in Artificial Intelligence, and it is not surprising that this tradition should try and stick 
to symbolic approaches when dealing with decision problems. Formulating decision 
problems in a symbolic setting may be more compatible with a declarative expression 
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of uncertainty and preferences in some logic-based language [3, 39]. On the other 
hand, the emergence of new technologies like information systems or autonomous 
robots has generated many new decision problems involving intelligent agents [4] 
where quantifying preference and belief is neither imperative nor easy. 

There is a need for qualitative decision rules. However there is no real agreement 
on what "qualitative" means. Some authors assume incomplete knowledge about 
classical additive utility models, whereby the utility function is specified via symbolic 
constraints ([31, 1] for instance). Others use sets of integers and the like to describe 
rough probabilities or utilities [38]. Lehmann [33] injects some qualitative concepts 
of negligibility in the classical expected utility framework. However some approaches 
are genuinely qualitative in the sense that they do not involve any form of quantifica- 
tion. We take it for granted that a qualitative decision theory is one that does not re- 
sort to the full expressive power of numbers for the modeling of uncertainty, nor for 
the representation of utility. 



2 Quantitative and Qualitative Decision Rules 

A decision problem is often cast in the following framework [36]: consider set S of 
states (of the world) and a set X of potential consequences of decisions. States encode 
possible situations, states of affairs, etc. An act is viewed as a mapping f from the 
state space to the consequence set, namely, in each state s e S, an act f produces a 
well-defined result f(s) g X. The decision maker must choose acts without knowing 
what is the current state of the world in a precise way. The consequences of an act can 
often be ranked in terms of their relative appeal: some consequences are judged to be 
better than others. This is often modeled by means of a numerical utility function u 
which assigns to each consequence x g X a utility value u(x) g r. Besides, there are 
two usual approaches to modeling the lack of knowledge of the decision maker about 
the state. The most widely found assumption is that there is a probability distribution 
p on S. It is either obtained from statistics (this is called decision under risk) or it is a 
subjective probability supplied by the agent via suitable elicitation methods. Then the 
most usual decision rule is based on the expected utility criterion: 

EU(f) = S,^sP(s)u(f(s)). (1) 

An act f is strictly preferred to act g if and only if EU(f) > EU(g). This is by far the 
most commonly used criterion. It makes sense especially for repeated decisions 
whose results accumulate. It also clearly presupposes subjective notions like belief 
and preference to be precisely quantified. It means that, in the expected utility model, 
the way in which the preference on consequences is numerically encoded will affect 
the induced preference relation on acts. The model exploits some extra information 
not contained solely in a preference relation on X, namely, the absolute order of mag- 
nitude of utility grades. Moreover the same numerical scale is used for utilities and 
degrees of probability. This is based on the idea that a lottery (involving uncertainty) 
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can be compared to a sure gain or a sure loss (involving utility only) in terms of pref- 
erence. 

Another proposal is the Wald criterion. It applies when no information about the 
current state is available, and it ranks acts according to its worst consequence: 

W“(f) = minge s u(f(s)). (2) 

This is a well-known pessimistic criterion. An optimistic counterpart W^(f) of 
W~(f) is obtained by turning minimum into maximum. These criteria do not need 
numerical utility values. Only a total ordering on consequences is needed. No knowl- 
edge about the state of the world is necessary. Clearly this criterion has the major 
defect of being extremely pessimistic. In practice, it is never used for this reason. 
Hurwicz has proposed to use a weighted average of W~(f) and its optimistic counter- 
part, where the weight bearing on W~ (f) is viewed as a degree of pessimism of the 
decision maker. Other decision rules have been proposed, especially some that gener- 
alize both EU(f) and W~(f), based on Choquet Integral [29]. However, all these exten- 
sions again require the quantification of preferences and/or uncertainty. 

Qualitative variants of the Wald criterion nevertheless exist. Boutilier [3] is in- 
spired by preferential inference of nonmonotonic reasoning whereby a proposition p 
entails another one q by default if q is true in the most normal situations where p is 
true. He assumes that states of nature are ordered in terms of their relative plausibili- 
ties using a complete preordering > on S. He proposes to choose decisions on the 
basis of the most plausible states of nature in accordance with additional information, 
neglecting other states. If the additional information is that s e A, a subset of states, 
and if A* is the set of maximal elements in A according to the plausibility ordering >, 
then the criterion is defined by 

W>-(f) = minseA*u(f(s)). (3) 

This approach has been axiomatized by Brafman and Tennenholtz [5] in terms of 
conditional policies (rather than acts). Lehmann [33] axiomatizes a refinement of the 
Wald criterion whereby ties between equivalent worst states are broken by consider- 
ing their respective likelihood. This decision rule takes the form of an expected utility 
criterion with qualitative (infinitesimal) utility levels. An axiomatization is carried out 
in the Von Neumann-Morgenstern style. 

Another refinement of Wald criterion, the possibilistic qualitative criterion [18, 17] 
is based on a utility function u on X and a possibility distribution[19, 42] n on S, both 
mapping to the same totally ordered scale L, with top 1 and bottom 0. The ordinal 
value 7i(s) represents the relative plausibility of state s. A pessimistic criterion W~„(f) 
is proposed of the form: 

W“„(f) = ming £ s max(n(jt(s)), u(f(s))) (4) 

Here, L is equipped with its involutive order-reversing map n; in particular n(l) = 
0, n(0) = 1. So, n(ji(s)) represents the degree of potential surprise caused by realizing 
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that the state of the world is s. In particular, n(jt(s)) = 1 for impossible states. The 
value of W~„(f) is small as soon as there exists a highly plausible state (n(jt(s)) = 0 ) 
with low utility value. This criterion is actually a prioritized extension of the Wald 
criterion W~(f). The latter is recovered if 7t(s) = 1 for all s e S. The decisions are 
again made according to the merits of acts in their worst consequences, now restricted 
to the most plausible states. But the set of most plausible states (S* = {s, 7t(s) > 
n(W~^(f) )}) now depends on the act itself. It is defined by the compromise between 
belief and utility expressed in the min-max expression. The possibilistic qualitative 
criterion presupposes that degrees of utility u(f(s)) and possibility 7t(s) share the same 
scale and can be compared. 

The optimistic counterpart of this criterion is [18]: 

w„(f) = maXg g s min(7t(s)), u(f(s))) (5) 

This optimistic criterion has been first proposed by Yager [41] and the pessimistic 
criterion by Whalen [40]. These optimistic and pessimistic possibilistic criteria are 
actually particular cases of a more general criterion based on the Sugeno integral (a 
qualitative counterpart of the Choquet integral [29]) one expression of which can be 
written as follows: 



S/f) = max^^ S min(Y(A), min^ ^ ^ u(f(s))), (6) 

where y(A) is the degree of confidence in event A, and y is a set-function which re- 
flects the decision-maker attitude in front of uncertainty. If the set of states is rear- 
ranged in decreasing order via f in such a way that u(f(sj))>... >u(f(Sjj)), then denot- 
ing Aj = (sj,..., Sj], it turns out that Sy(f) is the median of the set [u(f(s]^)),..., 
u(f(Sn))]u(y(Ai),...,y(An_i)}. 

The restriction to the most plausible states, as implemented by the above criteria, 
makes them more realistic than the Wald criterion, but they still yield coarse rankings 
of acts (there are no more classes than the number of elements in the finite scale L). 
The above qualitative criteria do not use all the available information to discriminate 
among acts. Especially an act f can be ranked equally with another act g even if f is at 
least as good as g in all states and better in some states (including most plausible 
ones). This defect cannot be found with the expected utility model. Yet it sounds 
natural that the following constraint be respected by decision rules : 

Pareto-dominance: if Vs, u(f(s)) > u(g(s)), and 3 s, u(f(s)) > u(g(s)), then f is pre- 
ferred to g. 

The lack of discrimination of the Wald criterion was actually addressed a long time 
ago by Cohen and Jaffray [6] who improved it by comparing acts on the basis of their 
worst consequences of distinct merits, i.e. one considers only the set D(f, g) = [s, 
u(f(s)) u(g(s))} when performing a minimization. Denoting by f >□ g the strict 

preference between acts. 
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f >D g iff min o(f, g) u(f(s)) > min D(f, g) u(f(s)) (7) 

This refined rule always rates an act f better than another act g whenever f Pareto- 
dominates g. However, only a partial ordering of acts is then obtained. This last deci- 
sion rule is actually no longer based on a preference functional. It has been independ- 
ently proposed and used in fuzzy constraint satisfaction problems under the name 
“discrimin ordering” (see references in [16]). 



3 An Ordinal Decision Rule without Commensurateness 

Many of the above decision rules presuppose that utility functions and uncertainty 
functions share the same range, so that it makes sense to write min(jt(s)), u(f(s))), for 
instance. In contrast, one may look for a natural decision rule that computes a prefer- 
ence relation on acts from a purely symbolic perspective, no longer assuming that 
utility and partial belief are commensurate, that is, share the same totally ordered 
scale [10]. The decision-maker only supplies a confidence relation between events 
and a preference ordering on consequences. 

In the most realistic model, a confidence relation on the set of events is an irreflex- 

ive and transitive relation >l on 2^, and a non-trivial one (S >l 0 ), faithful to deduc- 
tive inference. A B means that event A is more likely than B. Moreover, if A c B 
and C c D, then A >l D should imply B >l C [27]. This so-called orderly property 
implies that if A c B, then A cannot be more likely than B. Define the weak likeli- 
hood relation induced from via complementation, and the indifference relation 

— as usual: A >l B iff not (B A); and A — ^ iff ^ -L ® ® -L 

The preference relation on the set of consequences X is supposed to be a complete 
preordering. Namely, >p is a reflexive and transitive relation, and completeness 

means x >p y or y >p x. So, Vx, ye X, x >p y means that consequence x is not worse 
than y. The induced strict preference relation is derived as usual: x >p y if and only if 
X >p y and not y >p x. It is assumed that X has at least two elements x and y s.t. x >p 
y. The assumptions pertaining to >p are natural in the scope of numerical representa- 
tions of utility, however we do not require that the weak likelihood relation be a com- 
plete preordering. 

If the likelihood relation on events and the preference relation on consequences are 
not comparable, a natural way of lifting the pair (>p, >p) to X^ is as follows: an act f 

is more promising than an act g if and only if the event formed by the disjunction of 
states in which f gives better results than g, is more likely than the event formed by 
the disjunction of states in which g gives results better than f. A state s is more prom- 
ising for act f than for act g if and only if f(s) >p g(s). Let [f >p g] be an event made 
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of all states where f outperforms g, that is [f >p g] = {s e S, f(s) >p g(s)}. Accord- 
ingly, we define the strict preference between acts (>) as follows: 
f>g if and only if [f >p g] [g >p f] ; 

This is the Likely Dominance Rule [ 10 ], f > g (complete preordering ) and f ^ — g 

(incomparability or indifference) stand for not(g > f), and f > g and g > f, respec- 
tively. It is the first one that comes to mind when information is only available under 
the form of an ordering of events and an ordering of consequences and when the 
preference and uncertainty scales are not comparable. Events are only compared to 

events, and consequences to consequences. The properties of the relations >, — , and 

> on will depend on the properties of with respect to Boolean connectives. 
Note that if >p is a comparative probability ordering then the strict preference rela- 
tion > in is not necessarily transitive. 



Example 1 : 

A very classical and simple example of undesirable lack of transitivity is when S = 
{sj, S2, S3} and X = {xp X2, X3} with xj >p X2 >p X3, and the comparative probabil- 
ity ordering is generated by a uniform probability on S. Suppose three acts f, g, h with 
f(si) = xj >p f(s2) = X2 >p f(s3) = X3, g(s3) = xj >p g(si) = X2 >p g(s2) = X3, h(s2) 
= xj >p h(s3) = X2 >p h(si) = X3. Then [f >p g] = { sp S2 }; [g >p f] = { S3}; [g >p 
h] = { sp S3}; [h >p g] = { S2};[f >p h] = { SI }; [h >p f] = { S2, S3}. 

The likely dominance rule yields f > g, g > h, h > f. Note that the presence of 
this cycle does not depend on figures of utility that could be attached to conse- 
quences insofar as the ordering of utility values is respected for each state. The unde- 
sirable cycle remains if probabilities p(sj) > p(s2) > p(s3) are attached to states, and 

the degrees of probability remain close to each other (so that p(s2) + p(s3) > p(sj)). 
In contrast the ranking of acts induced by expected utility completely depends on the 
choice of utility values, even if we keep the constraint u(xj) > u(x2) > u(x3). The 
reader can check that, by symmetry, any of the three linear orders f> g> h, g> h> 
f, h > f > g can be obtained by suitably quantifying the utility values of states with- 
out changing their preference ranking. 

This situation can be viewed as a form of the Condorcet paradox in social choice, 
here in the setting of DMU. Indeed the problem of ranking acts can be cast in the 
context of voting. The likely dominance rule is a generalization of the so-called 
pairwise majority rule, whereby states are identified to voters and acts to candidates 
to be voted for. It is well-known that the social preference relation is often not transi- 
tive and may contain cycles. And that the transitivity of R is impossible under natural 
requirements on the voting procedure, such as independence of irrelevant alternatives, 
unanimity, and non-dictatorship. Variants of the pairwise majority rule are commonly 
found in multicriteria decision-making. However, the likely dominance rule makes 
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sense for any inclusion-monotonic likelihood relation between events and is much 
more general than the pairwise majority rule even in its weighted version. 

Assume now that a decision maker supplies a complete preordering of states in the 
form of a possibility distribution Jt on S and a complete preordering of consequences 
>p on X. Let >]-[ be the induced possibility relations on events [34, 8]. Namely, denot- 
ing max(A) any (most plausible) state s e A such that s >jj s’ for any s’ e A A >]-[ B 
if and only if max(A) >jj max(B). Define the preference on acts in accordance with 
the likely dominance rule, that is, for acts f and g: f > g iff [f >p g] >n [g >p f]; f ^ g 
iff — i(g > f). Then, the undesirable intransitivity of the strict preference vanishes. 



Example 1 (continued) 

Consider again the 3 -state/3 -consequence example of Section 3. If a uniform prob- 
ability is changed into a uniform possibility distribution, then it is easy to check that 

the likely dominance rule yields f ^ — g ^ ^ h. However, if sj >ji ^2 > S 3 then 
[f>p g] = (sj, S2 } >n [g>pf] = {S3}; [g >p h] = (sp S 3 |>n [h >p g] = {S2); 
[f>ph] = {si)>n [h>pf] = (s 2 , S3}. 

So f > g > h follows. It contrasts with the cycles obtained with a probabilistic ap- 
proach. However the indifference relation between acts is not transitive. 

Let us describe the likely dominance rule induced by a single possibility distribu- 
tion (and the possibilistic likelihood relation it induces). If the decision maker is igno- 
rant about the state of the world, all states are equipossible, and all events but 0 are 
equally possible as well. So, if f and g are such that [g >p f] ^ 0 and [f >p g] 0 
hold, then none of f > g and g > f hold as per the likely dominance rule. The case 
when [f >p g] 0 and [g >p f] = 0 holds is when f Pareto-dominates g. Then, the 

relation on acts induced by the likely dominance rule reduces to Pareto-dominance. 
This method, although totally sound, is not decisive at all (it corresponds to the una- 
nimity rule in voting theory). 

Conversely, if there is an ordering sj, ..., Sjj of S such that 7 t(sj) > 7 t(s 2 ) >... > 
Jt(s^), then for any A, B such that A n B = 0, either A >j-[ B or B >p[ A. Hence V f 
g, either f > g or g > f. Moreover this is a lexicographic ranking of vectors (f(sj), ..., 
f(Sjj)) and (g(si), ..., g(Sn)): f > g iff 3k such that f(s]^) >p g(sp) and f(sj) ~p g(sj), V 

i < k. It corresponds to the procedure: check if f is better than g in the most normal 
state; if yes prefer f; if f and g give equally preferred results in sp check in the second 
most normal state, and so on recursively. It is a form of dictatorship by most plausible 
states, in voting theory terms. It also coincides with Boutilier’s criterion (3), except 
that ties can be broken by less normal states. 

More generally any complete preordering splits S into a well-ordered partition 
S| U S 2 u... u Sj. = S, Sj n Sj = 0 (i such that states in each Sj are equally plau- 
sible and states in Sj are more plausible than states in Sj, V j > i. In that case, the 
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ordering of events is defined as follows: A >hl ® if if min{i: 

SjnAnB'^} < min{i: SjPiBnA'^}, and the decision criterion is a blending of lexico- 
graphic ranking and unanimity among states. Informally, the decision maker proceeds 
as follows: f and g are compared on the set of most normal states (Sj): if f Pareto- 
dominates g in Sj, then f is preferred to g; if there is a disagreement in Sj about the 

relative performance of f and g then f and g are not comparable. If f and g have 
equally preferred consequences in each most normal state then the decision maker 
considers the set of second most normal states S 2 , etc. In a nutshell, it is a prioritized 
Pareto-dominance relation. Preferred acts are selected by focusing on the most plau- 
sible states of the world, and a unanimity rule is used on these maximally plausible 
states. Ties are broken by lower level oligarchies. So this procedure is similar to 
Boutilier’s decision rule in that it focuses on the most plausible states, except that 
Pareto-dominance is required instead of the Wald criterion, and ties can be broken by 
subsets of lower plausibility. This decision rule is cognitively appealing, but it has a 
limited expressive and decisive power. One may refine Boutilier’s rule using Wald 
criterion instead of Pareto-dominance inside the oligarchies of states. It is also easy to 
imagine a counterpart of the likely dominance rule where expected utility applies 
inside the oligarchies of states [32]. However reasonable these refined decision rules 
may look, they need to be formally justified. 



4 Axiomatics of Qualitative Decision Theory 

A natural question is then whether it is possible to found rational decision making in a 
purely qualitative setting, under an act-driven framework a la Savage. The Savage 
framework is adapted to our purpose of devising a purely ordinal approach because 
its starting point is indeed based on relations and their representation on an interval 
scale. Suppose a decision maker supplies a preference relation > over acts f: S X. 

usually denotes the set of all such mappings. In Savage's approach, any mapping 

in the set X^ is considered as a possible act (even if it is an imaginary one rather than 
a feasible one). The idea of the approach is to extract the decision maker's confidence 
relation and the decision maker's preference on consequences from the decision 
maker's preference pattern on acts. Enforcing "rationality" conditions on the way the 
decision maker should rank acts then determines the kind of uncertainty theory im- 
plicitly "used" by the decision maker for representing the available knowledge on 
states. It also prescribes a decision rule. Moreover, this framework is operationally 
testable, since choices made by individuals can be observed, and the uncertainty the- 
ory at work is determined by these choices. 

As seen in Sects. 2 and 3, two research lines can be followed in agreement with 
this definition: the relational approach and the absolute approach. Following the rela- 
tional approach, the decision maker uncertainty is represented by a partial ordering 
relation among events (expressing relative likelihood), and the utility function is just 
encoded as another ordering relation between potential consequences of decisions. 
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The advantage is that it is faithful to the kind of elementary information users can 
directly provide. The other approach, which can be dubbed the absolute approach [20, 
21] presupposes the existence of a totally ordered scale (typically a finite one) for 
grading both likelihood and utility. Both approaches lead to an act-driven axiomatiza- 
tion of the qualitative variant of possibility theory [19]. 



4.1 The Relational Approach to Decision Theory 

Under the relational approach [10, 12] we try to lay bare the formal consequences of 
adopting a purely ordinal point of view on DMU, while retaining as much as possible 
from Savage's axioms, and especially the sure thing principle which is the cornerstone 
of the theory. However, an axiom of ordinal invariance [22], originally due to 
Fishburn [26] in another context, is added. This axiom stipulates that what matters for 
determining the preference between two acts is the relative positions of consequences 
of acts for each state, not the consequences themselves, nor the positions of these acts 
relative to other acts. More rigorously, two pairs of acts (f, g) and (f, g') such that 
VseS, f(s) >p g(s) if and only if f(s) >p g'(s) are called statewise order-equivalent. 
This is denoted (f, g) = (f , g'). It means that in each state consequences of f, g, and of 
f , g', are rank-ordered likewise. The Ordinal Invariance axiom [22] is: 

OI: Vf, f g, g' e XS, if (f, g) ^ (f , g') then ( f > g iff f > g'). 

where "iff" is shorthand for "if and only if". It expresses the purely ordinal nature of 
the decision criterion. It is easy to check that the likely dominance rule obeys axiom 
OI. This is obvious noticing that if (f, g) = (f , g') then by definition, [f >p g] = ( s, f(s) 
>p g(s) } = [f >p g'] . More specifically, under OI, if the weak preference on acts is 

reflexive and the induced weak preference on consequences is complete, the only 
possible decision rule is likely dominance. 

Let A c S be an event, f and g two acts, and denote by fAg the act such that fAg(s) 
= f(s) if s G A, and g(s) if s g A. The set of acts is closed under this combination 
involving acts and events. Under the same assumptions, OI ensures the validity of 
basic Savage axioms: 

S2 (Sure-Tthing Principle): VA, f, g, h, h', fAh > gAh iff fAh' > gAh' 

Axiom S2 claims that the relative preference between two acts does not depend on 
states where the acts have the same consequences. In other words, the preference 
between fAh and gAh does not depend on the choice of h. Conditional preference on 
a set A, denoted (f > g)^, is when Vh, fAh > gAh holds (which under S2 is equivalent 

to 3h, fAh > gAh). An event A is said to be null if and only if Vf, g, (f > g)^. Any 
non-empty set of states A on which all acts make no difference is then like the empty 
set: the reason why all acts make no difference is because this event is considered 
impossible by the decision maker. 
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Among acts in are constant acts such that: 3 x g X, V s g S, f(s) = x. They are 
denoted fx. It seems reasonable to identify the set of constant acts {fx, x g X} and X. 
The preference >p on X can be induced from (X^, >) as follows: 

Vx, y G X, X >p y if and only if fx > fy. (8) 

This definition is self-consistent provided that the preference between constant acts 
is not altered by conditioning. This is the third Savage's postulate 

S3: V A c S, A not null, fxAh > fyAh ,V h, if and only if x >p y. 

The preference on acts also induces a likelihood relation among events. For this 
purpose, it is enough to consider the set of binary acts, of the form fxAfy, which due 
to (S3) can be denoted xAy, where x g X, y g X, and x >p y. Clearly for fixed x >p 

y, the set of acts {x, y}^ is isomorphic to the set of events 2^. However the restriction 
of (X^, >) to {x, y}^ may be inconsistent with the restriction to {x', y'}^ for other 
choices of consequences x' >p y'. A relative likelihood >p among events can however 
be recovered, as suggested by Lehmann [32]: 

VA, B c S, A >L B if and only if xAy > xBy, Vx, y g X such that x >p y. 

In order to get a complete preordering of events. Savage added another postulate: 
S4: Vx, y, x', y' g X s.t. x >p y, x' >p y', then xAy > xBy iff x'Ay' > x'By'. 

Under this property, the choice of x, y g X with x >p y does not affect the ordering 
between events in terms of binary acts, namely: A >p B is short for 3 x >p y, xAy> 
xBy. 

Adopting axiom OI and sticking to a complete and transitive weak preference on 
acts (axiom SI of Savage) leads to problems met in the previous section by the prob- 
abilistic variant of the likely dominance rule. The following result was proved [12]: 

Theorem: If (X^, >) is a complete preordering satisfying axiom OI, and S and X 
have at least three non-equally preferred elements, let >p be the likelihood relation 
(induced by S4). Then, there is a permutation of elements of S, such that sj >p S 2 
>L-.. >L Sjj_j >p Sj^ >p 0 and V i = 1, ...n - 2, Sj >p [sj+p ..., Sjj}. Other states Sj 

such that i > n are impossible (sj — 0). 

If X only has two consequences of distinct values, then such a trivialization is 
avoided. Nevertheless in the general case where X has more than two non-equally 
preferred elements, the Ordinal invariance axiom forbids a Savagean decision maker 
to believe that there are two equally likely states of the world, each of which being 
more likely than a third state. This is clearly not acceptable in practice. If we analyze 

the reason why this phenomenon occurs, it is easy to see that the transitivity of (X^, 
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>) plays a crucial role, in so far as we wish to keep the sure-thing principle. It implies 
the full transitivity of the likelihood relation >l- Giving up the transitivity of sup- 
presses the unnatural restriction of an almost total ordering of states: we are led to a 
weaker condition: 

WSl: (XS, >) is a partially ordered set equipped with a transitive, irreflexive relation. 

As a consequence, if one insists on sticking to a purely ordinal view of DMU, we 
come up to the framework defined by axioms WSl, S3, and 01, plus a non-triviality 
axiom (S5 : f > g for at least two acts). The likelihood relation induced by S4 is 
then orderly (coherence with set inclusion). Moreover, if X has more than two non 
equally preferred elements, satisfies the following strongly non-probabilistic prop- 
erty: for any three pairwise disjoint non-null events A, B, C, 

B u C >L A and A u C B imply C >l A u B. (9) 

Dropping the transitivity of > cancels some useful consequences of the sure-thing 
principle under SI, which are nevertheless consistent with the likely dominance rule, 
especially the following unanimity axiom [32]: 

U: (f > g)A and (f > g)^c implies f > g. 

If this property is added, null events are then all subsets of a subset N of null states. 
The likelihood relation can then always be represented by a family of possibility 
relations. Namely, there is a family of possibility relations on S and a complete 
preordering >p on X such that the preference relation on acts is defined by 

f > g iff V >n e c#, [f >p g] >n [g >P f]- (10) 

This ordinal Savagean framework actually leads to a representation of uncertainty 
that is at work in the nonmonotonic logic system of Kraus, Lehmann and Magidor 
[30], as also shown by Friedman and Halpern [27] who study property (9). 

A more general setting starting from a reflexive weak preference relation on acts is 
used in Dubois et al. [13, 14]. In this framework S3 is replaced by a monotonicity 
axiom on both sides, that is implied by Savage's framework, namely for any event A: 



If [h >p f ] = S and f > g then fAh > g; If [g >p h] = S and f > g then f > gAh. 

The weak likelihood relation >p can be represented by a single possibility relation 
if the unanimity property is extended to the disjunction of any two subsets of states, 
and an axiom of anonymity, stating that exchanging the consequences of two equally 
plausible states does not alter the decision maker's preference pattern, is added. Then 
the decision rule described at the end of section 3 is recovered exactly. 

The restricted family of decison rules induced by the purely relational approach to 
the decision problem under uncertainty reflects the situation faced in voting theories 
where natural axioms lead to impossibility theorems. These results question the very 
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possibility of a purely ordinal solution to this problem, in the framework of transitive 
and complete preference relations on acts. The likely dominance rule lacks discrimi- 
nation, not because of indifference between acts, but because of incomparabilities. 
Actually, it may be possible to weaken axiom 01 while avoiding the notion of cer- 
tainty equivalent of an uncertain act. It must be stressed that 01 requires more than 
the simple ordinal nature of preference and uncertainty (i.e. more than separate ordi- 
nal scales for each of them). Condition 01 also involves a condition of independence 
with respect to irrelevant alternatives. It says that the preference f > g only depends 
on the relative positions of quantities f(s) and g(s) on the preference scale. This un- 
necessary part of the condition could be cancelled within the proposed framework, 
thus leaving room for a new family of rules not considered in this paper, for instance 
involving a third act or some prescribed consequence considered as an aspiration 
level. 



4.2 Qualitative Decision Rules under Commensurateness 

Let us now consider the absolute qualitative criteria (4), (5), (6), based on Sugeno 
integral in the scope of Savage theory. Clearly, they satisfy SI. However the sure 
thing principle can be severely violated by Sugeno integral. It is easy to show that 
there may exist f, g, h, h’ such that fAh > gAh while gAh’> fAh’. It is enough to 
consider binary acts (events) and notice that, generally if A is disjoint from BuC, 
nothing forbids, for a fuzzy measure y, to satisfy y(B) > y(C) along with y(AuC) > 
Y(AuB) (for instance, belief functions are such). The possibilistic criteria (4), (5) 
violate the sure-thing principle to a lesser extent since: 

VA c S, Vf, g, h, h', W„(fAh) > W“„(gAh) implies W„(fAh') > W“„(gAh') 

And likewise for Moreover, only one part of S3 holds, for Sugeno integrals. 
The obtained ranking of acts satisfies the following axiom: 

WS3: V A c S, Vf, X >p y implies xAf > yAfy. 

Besides, axiom S4 is violated by Sugeno integrals, but to some extent only. 
Namely, Vx, y, x', y' e X s.t. x >p y, x' >p y': Sy(xAy) > Sy(xBy) implies Sy(x'Ay') > 
Sy(x'By'), which forbids preference reversals when changing the pair of consequences 
used to model events A and B. Moreover the strict preference is maintained if the 
pair of consequences is changed into more extreme ones: If x' >p x >p y >p y' then 
Sy(xAy) > Sy(xBy) implies Sy(x'Ay') > Sy(x'By'). Sugeno integral and its possibilistic 
specializations are weakly Pareto-mono tonic since Vf, f >p g implies Sy(f) > Sy(g), but 
one may have f(s) >p g(s) for some state s, while Sy(f) = S^(g). This is the so-called 

drowning effect, which also appears in the violations of S4. This is because some 
states are neglected when comparing acts. The basic properties of Sugeno integrals 
exploit disjunctive and conjunctive combinations of acts. Namely, given a preference 
relation (X ,>), and two acts f and g, define f Ag and fvg as follows 
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f Ag (s) = f(s) if g(s) >p f(s), and g(s) otherwise 

fvg (s) = f(s) if f(s) >p g(s), and f(s) otherwise 

Act f Ag always produces the worst consequences of f and g in eact state, while 
fvg always makes the best of them. They are union and intersection of acts viewed as 
fuzzy sets. Obviously S.^(fAg) < min(Sy(f), Sy(g)) and S.^(fvg) > max(S.^(f), S.^(g)) from 
weak Pareto monotonicity. These properties hold with equality whenever f or g is a 
constant act (or when they are comonotonic). These properties are in fact characteris- 
tic of Sugeno integrals for monotonic aggregation operators [29]. These properties 
can be expressed by means of axioms, called restricted conjunctive and disjunctive 
dominance (RCD and RDD) on the preference structure (X ,>): 

Axiom RCD; if f is a constant act, f > h and g > h imply f Ag > h 

Axiom RDD: if f is a constant act, h > f and h > g imply h > fvg 

For instance, RCD means that upper-bounding the potential utility values of an act 
g that is better than another one h by a constant value that is better than the utility of 
act h still yields an act better than h. This is in contradiction with expected utility 
theory. Indeed, suppose g is a lottery where you win 1000 euros against nothing with 
equal chances. Suppose the certainty equivalent of this lottery is 400 euros, received 
for sure, and h is the fact of receiving 390 euros for sure. Now, it is likely that, if f 
represents the certainty-equivalent of g, f Ag will be felt strictly less attractive than h 
as the former means you win 400 euros against nothing with equal chances. Axiom 
RCD implies that such a lottery should ever be preferred to receiving 400 - 8 euros 
for sure, for arbitrary small values of 8. This axiom is thus strongly counterintuitive in 
the context of economic theory, with a continuous consequence set X. However the 
range of validity of qualitative decision theory is precisely when both X and S are 
finite. Two presupositions actually underlie axiom RCD (and similar ones for RDD) 

1) There is no compensation effect in the decision process: in case of equal 
chances, winning 1000 euros cannot compensate the possibility of not getting any- 
thing. It fits with the case of one-shot decisions where the notion of certainty equiva- 
lent can never materialize: you can only get 1000 euros or get nothing if you just play 
once. You cannot get 400 euros. The latter can only be obtained in the average, by 
playing several times. 

2) There is a big step between one level Xj e L in the qualitative value scale and 

the next one with L={1 = X]^>...> 0}. The preference pattern f > h al- 

ways means that f is significantly preferred to h so that the preference level of f Ag 
can never get very close to that of h when g > h. The counterexample above is ob- 
tained by precisely moving these two preference levels very close to each other so 
that f Ag can become less attractive than the sure gain h. Level Xj + j is in some sense 

negligible in front of Xj. 
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Sugeno integral and can be axiomatized in the style of Savage [20, 21], Namely, if 
the preference structure (X^,>) satisfies SI, WS3, S5, RCD and RDD, then there a 
finite chain of preference levels L, an L-valued possibility monotonic set-function y, 
and an L-valued utility function on the set of consequences X, such that the prefer- 
ence relation on acts is defined by : f > g iff Sy(f) > Sy(g). Namely, SI, WS3, and 
S5 imply Pareto-monotonicity. In the representation method, L is the quotient set 
X ^/ — the utility value u(x) is the equivalence class of the constant act fx, the de- 
gree of likelihood y(A) is the equivalence class of the binary act lAO, having extreme 
consequences. 

It is easy to check that the equalities W~„(fAg) = min(W~„(f), W~^(g)) and 
W~^^(fvg) = max(W^^(f), W^„(g)) hold with any two acts f and g, for the pessimistic 
and the optimistic possibilistic preference functionals respectively. The criterion 
W~^(f) can thus be axiomatized by strengthening the axioms RCD as follows: 

Axiom CD: Vf, g, h, f > h and g > h imply f Ag > h (conjunctive dominance) 

Together with SI, WS3, RDD and S5, CD implies that the set-function y is a ne- 
cessity measure and so, Sy(f) = W~„(f) for some possibility distribution n. Similarly, 
the criterion W^„(f) can be axiomatized by strengthening the axioms RDD as follows 

Axiom DD: Vf, g, h, f > h and g > h imply f Ag > h (disjunctive dominance) 

Together with SI, WS3, RCD and S5, DD implies that the set-function y is a pos- 
sibility measure and so, Sy(f) = W^^(f) for some possibility distribution n. 

In order to figure out why axiom CD leads to a pessimistic criterion, Dubois Prade 
and Sabbadin [21] have noticed that it can be replaced by the following property: 

VA c S, Vf, g, fAg > g implies g > gAf (11) 

This property can be explained as follows: if changing g into f when A occurs re- 
sults in a better act, the decision maker has enough confidence in event A to consider 
that improving the results on A is worth trying. But in this case there is less confi- 
dence on the complement A‘^ than in A, and any possible improvement of g when A“^ 
occurs is neglected. Alternatively, the reason why fAg > g holds may be that the con- 
sequences of g when A occurs are very bad and the occurrence of A is not unlikely 
enough to neglect them, while the consequences of g when A“^ occurs are acceptable. 
Then suppose that consequences of f when A occurs are acceptable as well. Then fAg 
> g. But act gAf remains undesirable because, even if the consequences of f when A“^ 
occurs are acceptable, act gAf still possesses plausibly bad consequences when A 
occurs. So, g > gAf. For instance, g means losing (A) or winning (A“^) 10,000 euros 
with equal chances according to whether A occurs or not, and f means winning either 
nothing (A) or 20,000 euros (A‘^) conditioned on the same event. Then fAg is clearly 
safer than g as there is no risk of losing money. However, if (11) holds, then the 
chance of winning much more money (20,000 euros) by choosing act gAf is ne- 
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glected because there is still a good chance to lose 10,000 euros with this lottery. This 
behavior is clearly cautious. An optimistic counterpart to (11) can serve as a substi- 
tute to axiom CD for the representation of 

VA c S, Vf, g, g > fAg implies gAf > g (12) 



5 Toward More Efficient Qualitative Decision Rules 

The absolute approach to qualitative decision criteria is simple (especially in the case 
of possibility theory). It looks more realistic and flexible than the likely dominance 
rule, hut it has some shortcomings. First one has to accept the commensurateness 
between utility and degrees of likelihood. It assumes the existence of a common scale 
for grading uncertainty and preference. It can be questioned, although it is already 
taken for granted in classical decision theory (via the notion of certainty equivalent of 
an uncertain event). It is already implicit in Savage approach, and looks acceptable 
for decision under uncertainty (but more debatable in social choice). Of course, the 
acts are then totally preordered. 

But absolute qualitative criteria lack discrimination due to many indifferent acts. 
They are consistent with Pareto-dominance only in the wide sense. Two acts can be 
considered as indifferent even if one Pareto-dominates the other. The Sure Thing 
principle is violated (even if not drastically for possibilistic criteria). The obtained 
ranking of decisions is bound to be coarse since there cannot be more classes of pref- 
erence-equivalent decisions than levels in the finite scale used. 

Giang and Shenoi [28] have tried to obviate the need for making assumptions on 
the pessimistic or optimistic attitude of the decision-maker and improve the discrimi- 
nation power in the absolute qualitative setting by using, as a finite bipolar utility 
scale, a totally ordered set of possibility measures on a two element set { 0 , 1 } con- 
taining the values of the best and the worst consequences. This setting leads to simple 
very natural axioms on possibilistic lotteries. Yet, this criterion has a major drawback: 
whenever there are two acts with contrasted consequences (respectively a bad or 
neutral one, and a good or neutral one) that have maximal possibility, then these acts 
are indifferent. 

The drowning effect of the possibilistic criteria can be fixed, in the face of total ig- 
norance as done in eq. (7) on the Wald criterion. This criterion can be further refined 
by the so-called leximin ordering: The idea is to reorder utility vectors (u(f(si)), ... 
u(f(sn))) by non-decreasing values as (u(f(S(j(j))), ... u(f(s 0 (n))))> where o is a per- 
mutation such that u(f(s 0 (i)))< u(f(s 0 ( 2 ))) ^ ^ u(f(s 0 (i))) . Let T be the corre- 

sponding permutation for an act g. Define the leximin criterion >Leximm follows: f 
>Leximm g iff 3 k < n such that V i < k, u(f(s 0 (i))) = u(g(sx(i))) and u(f(s 0 (k))) > 

u(g(sx(k)))- 

The two possible decisions are indifferent if and only if the corresponding reor- 
dered vectors are the same. The leximin-ordering is a refinement of the discrimin 
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ordering, hence of both the Pareto-ordering and the maximin-ordering [ 9 ]: f >d g 
implies f >Leximin g- Leximin optimal decisions are always discrimin maximal deci- 
sions, and thus indeed min-optimal and Pareto-maximal: >Leximm is the most selective 
among these preference relations. Converse implications are not verified. The Lexi- 
min ordering can discriminate more than any symmetric aggregation function, since 
when, e.g. the sum of u(f(s0(i)))’s equals the sum of u(g(sx(i)))’s, it does not mean 
that the reordered vectors are the same. 



5.1 Additive Refinements of Possibilistic Preference Functionals 

Interestingly, the qualitative leximin rule can be simulated by means of a sum of 
utilities provided that the levels in the qualitative utility scale are mapped to values 
sufficiently far away from one another on a numerical scale. Consider a finite utility 
scale L = {Xq< ...< Let a; = u(f(si)), and = u(g(si)) in L. Consider an increas- 
ing mapping \|/ from L to the reals whereby a; = \l/(a;) and b; = \l/(pi). It is possible to 
define this mapping in such a way that 

min, = i, .,.„a,>min,^i p, implies = i ,,.„bi (12) 

Moreover it can be checked that the leximin ordering comes down to applying the 
Bernoulli criterion with respect to a concave utility function \|/ o u 

f>lexim.„g ...n¥(u(f(si)))>2,^l_ . , \|/(u(g(si))). 

The optimistic maximax criterion can be refined similarly by a leximax ordering 
which can also be simulated by the Bernoulli criterion with respect to a convex utility 
function (|) o u, using an increasing mapping (|) from L to the reals 

f>leximaxg iff ^ i = 1 , . . .n WfCsi))) > ^ i = 1 , . . .„ (|)(u(g(si))). (13) 

The qualitative pessimistic and optimistic criteria under ignorance are thus refined 
by means of a classical criterion with respect to a risk-averse and risk-prone utility 
function respectively, as can be seen by plotting L against numerical values \l/(L) and 
(|)(L). These results have been recently extended to possibilistic qualitative criteria 
W~^(f) and W^„(f) by Fargier and Sabbadin [ 25 ]. They refine possibilistic utilities by 
means of weighted averages, thus recovering Savage five first axioms. 

Consider first the optimistic possibilistic criterion W^„(f) under a possibility distri- 
bution 7t. Let = u(f(sj)), and p; = u(g(sj)). We can again define an increasing map- 
ping \|/ from L to the reals such that \|/(X„i) = 1 and \|/(Xo) = 0 , and maxj min(jtj, ttj) > 
maxj min(7tj, pj) implies S i = i \|/(7ti)-\|/(aj) > S ; = j „ \|/(jii)-\|/(pi) 

A sufficient condition is that if Xj > Xj, \i/(Xj)-\|/(Xj) > \|/(Xj_i)-(\|/(Xj) + m-\|/(X^)). 
The following mapping \|/ can be chosen , K being a normalization factor: 
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\|/(X.)= 1/Kn2“ \i=l...m. 

Moreover, let {Su be the well-ordered partition of S induced by Jt, Sj con- 

taining the most plausible states nj = | Sj|. Suppose K = Z ; = i . , . k Hj /n^™ \ Then: 

• \|/07t is a probability assignment p respectful of the possibilistic ordering of 
states: p is uniform on equipossible states (the sets Si). Moreover, if s g Si then p(s) is 
greater than the sum of the probabilities of all less probable states, that is, p(s) > P(Sj 
+ iU...uS]j). Such probabilities generalize the so-called "big-stepped” probabilities 
(when the Si are singletons [2, 37]). 

• \|/0|J, is a big-stepped numerical utility function that can be encoded by a convex 
real mapping. 

• EU^(f) = Z. _ j ^ \|/(7ti)-\|/(u(f(si))) is an expected (big-stepped) utility for a risk- 
seeking decision-maker, and W^„(f) > W^^(g) implies EU^(f) > EU^(g). 

The pessimistic criterion can be similarly refined. Notice that W~„(f ; u, 7t) = n(W^„ 
(f ; nOu, 7t)) using the order-reversing map of L. Then, choosing the same mapping 
\|/ as above, one may have minj max(7tj, ttj) > miuj max(7tj, (3j) implies Z j _ j „ 
\|/(jti)-(|)(uj) > Z j = n \|/(7ti)-(|)(vj) where (|)*(Zi)= \|/( Zm) - \|/On(Zi). (|)0|J, is a big- 
stepped numerical utility function that can be encoded by a concave real mapping, 
EU~(f) = Z . _ j ^ \|/(jti) (|)(u(f(sj))) is an expected (big-stepped) utility for a risk- 

averse decision-maker, and W~„(f) > W~^(g) implies EU~(f) > EU~(g). 



5.2 Weighted Lexicographic Criteria 

The orderings induced by EU^(f) and EU~(f) actually correspond to generalizations 
of leximin and leximax to prioritized minimum and maximum aggregations, thus 
bridging the gap between possibilistic criteria and classical decision theory. These 
generalizations are nested lexicographic structures. Note that leximin and leximax 
orderings are defined on sets of tuples whose components belong to a totally ordered 

set (£2, >), say Leximin(>) and Leximax(>). Suppose Q. = (L^, Leximin) or (L^, 
Leximax). Then, nested lexicographic ordering relations (Leximin(Leximin(>)), 
Leximax(Leximin(>)), Leximin(Leximax(>)), Leximax(Leximax(>)) ), can be recur- 
sively defined, in order to compare L-valued matrices. 

Consider for instance the relation Leximax(Leximin(>)), denoted ^Lexmaxmin> ori set M 
of nxp matrices [a] with coefficients ajj in (L, >). M can be totally ordered in a very 
refined way by this relation. Denote by aj. row i of [a]. Let [a*] and [b*] be rear- 
ranged matrices [a] and [b] such that terms in each row are reordered increasingly 
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and rows are arranged lexicographically top-down in decreasing order, [a] > Lexmaxmin 
[b] is defined by : 3 k < n s.t. V i < k, a*j. = Leximin b*i- and a\. > Leximm b*k- 

Relation > Lexmaxmin IS a Complete preorder on M. [a] = Lexmaxmin [b] iff both matri- 
ces have the same coefficients up to a rearrangement. Moreover, > Lexmaxmin refines 
the optimistic criterion: maXj minj ajj > maxj minj bjj implies [a] > Lexmaxmin [b]- 
Moreover, if [a] Pareto-dominates [b], then [a] > Lexmaxmin [b] 

The same consideration applies when refining the minmax ordering by means of 
Leximin(Leximax(>)). Comparing acts f and g in the context of a possibility distribu- 
tion Jt can be done using relations > Lexmaxmin nnd > Lexminmax applied to n X 2 matrices 
[f] and [g] on (L, >) with coefficients fj^= Ttj (resp. n(jij)) and fi2= u, = ll(f(Si)), gil= 
Ttj (resp. n(7ij)) and gj 2 = Vj = jj,(g(sj)). If \|/ is the transformation from W^„(f) to EU^(f) 
such that \|/o7t is a big-stepped probability assignment, \|/o|j, and (|)0|J, are big-stepped 
utility assignments defined earlier, then it is possible to show that EU^(f) and EU (f) 
just encode the > Lexmaxmin and > Lexmmmax relations [25] : 

Theorem: [f]>Lexmaxmm[g] iff EU"(f) > EU"(f); [f] >Lexmmmax[g] iff EU“(f) > EU“(f). 

As a consequence, the additive preference functionals EU^(f) and EU (f) refining 
the possibilistic criteria are qualitative despite their numerical encoding. Moreover, 
the two orderings > Lexmaxmin and > Lexminmax of acts obey the Savage axioms of rational 
decision. These orderings can be viewed as weighted generalizations of leximin and 
leximax relations (recovered if Jti= 1 for all i in case of total ignorance on states). 

They are also generalizations of possibility and necessity orderings on events. Re- 
lations >Lexmaxmin and >Lexminmax between acts Coincide if the utility functions are Boo- 
lean. This uncertainty representation is probabilistic, although qualitative, and is a 
lexicographic refinement of both possibility and necessity orderings. Suppose 3 A, B, 
f = Xa’ g = Xb (indicator functions). Then define A >nLex E <=> Va ^leximax ''B where 
Va is the vector (aj^, ..., aj^) such that aj = 7t(sj) if sj e A ; aj = 0 otherwise. This rela- 
tion is called "leximax" likelihood. The leximax likelihood relation can be generated 
from the well-ordered partition (Sj, ...,Sjjj of S induced by jt: A >nLex E iff there is 

an Sj such that |B n (Sj u...uSj)| < |A n (S^ u...uSj)| and |A n (Sj u...uSp)| = |B 
n (Sj u...uSp)| for all p < j. 

The relation ^nLex i® ^ complete preordering of events whose strict part refines the 
possibilistic likelihood >p[L °f section 3. In fact, ^nLex coincides with >j-[l for linear 
possibility distributions. For a uniform possibility distribution ^nLex coincides with 
the comparative probability relation that is induced by a uniform probability (the 
cardinality-based definition gives A >nLex E iff |B| < |A|). This is not surprizing in 

view of the fact that the leximax likelihood relation is really a comparative probabil- 
ity relation in the usual sense. 
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An open problem is to provide a similar lexicographic refinement of Sugeno inte- 
gral. Some preliminary discussions [24] provide some insight on the discrimin exten- 
sion of this criterion. There is no hope of refining Sugeno integral by means of an 
expected utility since the former strongly violates the sure-thing principle. More 
recent results [15] and the form of equation (6) suggest that it makes sense to refine a 
Sugeno integral by means of a Choquet integral with respect to a special kind belief 
function which is a generalization of big-stepped probabilities. 



6 Conclusion 

This paper has provided an account of qualitative decision rules under uncertainty. 
They can be useful for solving discrete decision problems involving finite state spaces 
when it is not natural or very difficult to quantify utility functions or probabilities. For 
instance, there is no time granted to do it because a quick advice must be given (re- 
commender systems). Or the problem takes place in a dynamic environment with a 
large state space, a non-quantifiable goal to be pursued, and there is only partial in- 
formation on the current state (autonomous vehicles). Or yet a very high level de- 
scription of a decision problem is available, where states and consequences are 
roughly described (strategic decisions). The possibilistic criteria are compatible with 
dynamic programming algorithms for multiple-stage decision problems [23, 35]. This 
topic, with minor adaptation, is also relevant to multicriteria decision making, where 
the various objectives play the role of the states and the likelihood relation is used to 
compare the relative importance of groups of objectives [11, 13]. 

Two kinds of qualitative decision rules have been found. Some are consistent with 
the Pareto ordering and satisfy the sure thing principle, but leave room to incompara- 
ble decisions, and overfocus on most plausible states. The other ones do rank deci- 
sions, but lack discrimination. It seems that there is a conflict between fine-grained 
discrimination and the requirement of a total ordering in the qualitative setting. Fu- 
ture works should strive towards exploiting the complementary features of prioritized 
Pareto-efficient decision methods as the likely dominance rule, and of the pessimistic 
decision rules related to the Wald criterion. Putting these two requirements together 
seems to lead us back to a very special case of expected utility, as very recent results 
described in the previous section show. The lexicographic criteria so-obtained meet 
requirements for a good qualitative decision theory: no numbers are requested as 
inputs, only a few levels of preference are needed (cognitive relevance), decisiveness 
is ensured (no incomparability), as well as optimal discrimination in accordance with 
Pareto-dominance. 
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Abstract. An overview is provided of recent developments in the use 
of latent class (LC) models in social science research. Special attention 
is paid to the application of LC analysis as a factor-analytic tool and as 
a tool for random-effects modeling. Furthermore, an extension of the LC 
model to deal with nested data structures is presented. 



1 Introduction 

Latent class (LC) analysis was introduced by Lazarsfeld in 1950 as a way of 
formulating latent attitudinal variables from dichotomous survey items (see [11]). 
During the 1970s, LC methodology was formalized and extended to nominal 
variables by Goodman [6] who also developed the maximum likelihood algorithm 
that has served as the basis for most LC programs. It has, however, taken many 
years till the method became a generally accepted tool for statistical analysis. 
The history and state-of-art of LC analysis in social science research is described 
in the recent volume “Applied Latent Class Analysis” edited by Hagenaars and 
McCutcheon [8]. 

Traditionally, LC models were used as clustering and scaling tools for di- 
chotomous indicators. Scaling models, such as the probabilistic Guttman scales, 
involved specification of simple equality constraints on the item conditional prob- 
abilities in order to guarantee that the latent variable would capture a single 
underlying dimension. A more recent development is to parametrize the item 
conditional by means of logit models, yielding restricted variants of LC analysis 
which are similar to latent trait models (see [4], [9], and [19]). The log-linear 
modeling framework with latent variables implemented in the LEM software 
package yields are general class of probability models (graphical models) for cat- 
egorical observed and latent variables in which each of the model probabilities 
can be restricted by logit constraints (see [7] and [17]). The LEM framework con- 
tains most types of LC models for categorical observed variables as special cases, 
including models with several latent variables, models with covariates, models 
for ordinal variables, models with local dependencies, causal models with latent 
variables, and latent Markov models. 

A very much related field is the field of finite mixture (EM) modelling 
(see[15]). Traditionally, finite mixture models dealt with continuous outcome 
variables. The underlying idea of LC and EM models is, however, the same: the 
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population consists of a number of subgroups which differ with respect to the 
parameters of the statistical model of interest. It is, therefore, not surprising 
that in recent years, the fields of LC and FM modeling have come together and 
that the terms LC model and FM model have become interchangeable with each 
other. For example, mixture model clustering and mixture regression analysis 
are now also known as LC clustering and LC regression analysis. 

The software package Latent GOLD (see [20] and [21]) implements the most 
important social science application types of LC and FM models - clustering, 
scaling, and random-effects modeling - in three modules: LC cluster, LC factor, 
and LC regression. What is very important for applied researchers is that the 
models are implemented in a SPSS-like graphical user interface. The use of LC 
analysis for clustering purposes is also well-known outside the social science field. 
LC factor is a factor-analytic tool for discrete or mixed outcome variables (see 
[12]). LC regression makes it possible to take into account unobserved population 
heterogeneity with respect to the coefficients of a regression model (see [23] and 
[24]). In this paper, I will explain the basic ideas underlying the LC factor and 
LC regression models and present several empirical examples. 

LC models are models for two-level data structures. The data consists of a set 
of indicators or a set of repeated responses which are nested within individuals. 
Recently, models have been proposed for nested data structures consisting of 
more than two levels, such as repeated measures nested within persons and 
persons nested with groups - teams, countries, or organizations (see [18]). At the 
end of this paper, I will pay attention to this hierarchical or multilevel extension 
of the LC model and present a procedure called upward-downward algorithm 
that can be used to solve the maximum likelihood estimation problem. 

2 The LC Factor Model 

Let us start introducing some notation. Let yik denote the realized value of 
person i on the fcth indicator, item, or response variable. The total number of 
response variables is denoted by K . A category of the £th latent class variable 
will be denoted as xi, its total number of categories as T^, and the total number 
of latent variables by L. 

The standard LC model that I will refer to as the LC cluster model assumes 
that responses are independent of each other given a single latent variable with 
Ti unordered categories. The density of is defined as 

Ti K 

f{y^) = £ p{xi) n f{yik\xi), 

a:i — 1 k—1 

where the exact form of the class-specific densities f{yik\xi) depends on the 
scale type of the response variable concerned. The f{yik\xi) are taken from the 
exponential family. 

The main difference between the LC factor and the LC cluster model is that 
the former may contain more than one latent variable. Another difference is 
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that in the factor model the categories of the latent variables are assumed to be 
ordered. Thus, rather than working with a single nominal latent variable, here 
we work with one or more dichotomous or ordered polytomous latent variables 
(Magidson and Vermunt in [12]). The advantage of this approach is that it 
guarantees that each of the factors capture no more than one dimension. 

The primary difference between our LC factor and the traditional factor anal- 
ysis model is that the latent variables (factors) are assumed to be dichotomous 
or ordinal as opposed to continuous and normally distributed. Because of the 
strong similarity with traditional factor analysis, this approach is called LC fac- 
tor analysis. There is also a strong connection between LC factor models and 
item response or latent trait models. Actually, LC factor models are discretized 
variants of well-known latent trait models for dichotomous and polytomous items 
(see [9], [19], and [22]). 

As in maximum likelihood factor analysis, modeling under the LC factor 
approach can proceed by increasing the number of factors until a good fitting 
model is achieved. This approach to LC modeling provides a general alternative 
to the traditional method of obtaining a good fitting model by increasing the 
number of latent classes. In particular, when working with dichotomous uncor- 
related factors, there is an exact equivalence in the number of parameters of the 
two models. A LC factor model with 1 factor has the same number of parameters 
as a 2-class LC cluster model, a model with 2 factors as a 3-class model, a model 
with 3 factors as a 4-class model, etc. Thus, in an exploratory analysis, rather 
than increasing the number of classes one may instead increase the number of 
factors until an acceptable fit is obtained. 

2.1 A Two-Factor Model for Nominal Indicators 

To illustrate the LC factor model, let us assume that we have a two-factor model 
for four nominal categorical indicators. The corresponding probability structure 
is of the form 

Ti T2 4 

EE P{XI,X2) P{y^k\Xl,X2)■ 

X\ — lX2 — l k — 1 

The conditional response probabilities P{yik\xi, X 2 ) are restricted by means of 
multinomial logit models with linear terms 

r]{yik\xi,X2) = + Ply^ ■ + P2y^ ■ 

Because the factors are assumed to be ordinal (or discrete interval) variables, 
the two- variable terms are restricted by using fixed category scores for the levels 
of the factors. Note that the factors are treated as metric variables, which are, 
however, not continuous but discrete. The scores for the categories of the 
fth factor are equidistant scores ranging from 0 to 1. The first level of a factor 
gets the score 0 and the last level the score 1. The parameters describing the 
strength of relationships between the factors and the indicators ~ here, f3iy^. and 
/? 2 yfc “ can be interpreted as factor loadings. 




Applications of Latent Class Analysis in Social Science Research 25 

Note that the above logit model does not include the three-variable inter- 
action term of the two factors and the indicator. These higher-order terms are 
excluded from the model in order to be able to distinguish the various dimen- 
sions. If we would include the three-variable interaction term, our two-factor 
model would be equivalent to an unrestricted 4-cluster model. By excluding this 
term, we obtain a restricted 4-cluster model in which each of the four clusters 
can be conceived as being a combination of two factors. 

In the standard LC factor model, the factors are specified to be dichotomous, 
which means that the scoring of the factor levels does not imply a constraint. An 
important extension of this standard model is, however, increasing the number 
of levels of a factor, which makes it possible to describe more precisely the 
distribution of the factor concerned. Note that the levels of the factors remain 
ordered by the use of fixed equal-interval category scores in their relationships 
with the indicators. Therefore, each additional level costs only one degree of 
freedom; that is, there is one additional class size to be estimated. 

In the default setting, the factors are assumed to be independent of one an- 
other. This is specified by the appropriate logit constraints on the latent prob- 
abilities. In the two- factor case, this involves restricting the linear term in the 
logit model for P{xi,X 2 ) by 



VxiX2 Tki T 'yX2- 



Working with correlated factors is comparable to performing an oblique rotation. 
The association between each pair of factors is described by a single uniform 
association parameter: 



VxiX2 = 7X1 + 7X2 + 712 • Uxi • 11X2- 

It should be noted that contrary to traditional factor analysis, the LC factor 
model is identified without additional constraints, such as setting certain factor 
loadings equal to zero. Nevertheless, it is possible to specify models in which 
factor loadings are fixed to zero. Together with the possibility to include factor 
correlation in the model, this option can be used for a confirmatory factor anal- 
ysis. Other extensions are the use of indicators which are ordinal, continuous, 
or counts, the inclusion of local dependencies, and the inclusion of covariates 
affecting the factors. 

Zhang [25] proposed a LC model with several latent variables called hierarchi- 
cal LC model that is similar to our LC factor model presented. Three important 
differences are that his factors are nominal instead of ordinal, that indicators are 
allowed to be related to only one factor, and that factor correlations are induced 
by higher-order factors. 



2.2 Graphical Displays 

Magidson and Vermunt [12] proposed a graphical display similar to the one 
obtained in correspondence analysis to depict the results of a LC factor analysis. 
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These displays help in detecting which indicators are related to which factors. 
The measures that are display are derived from the posterior factor means. 
Case i’s posterior mean on factor £ equals 

n 

E{vii) = ^ VxtP{xe\yi). 

X£ — l 



The basic idea is to aggregate these posterior means (factor scores) and plot the 
resulting numbers in a two-dimensional display. Note that these numbers will be 
in the 0-1 range because the category score Vxg is 0 for the lowest factor level 
and 1 for the highest level. The most important aggregation is within categories 
of the indicators; that is, 



E{ve\yk) 



E{vu)I{y,k = Vk) 
J2^=il(yik = Vk) 



where I{yik = yk) equals 1 if person t’s value on indicator k is yk, and 0 otherwise. 
This yields the mean of factor £ for persons who give response yk on indicator 
k. These category-specific factor means will be very different if an indicator is 
strongly related to a factor. 

Aggregation can be done for any relevant subgroup and not just for categories 
of the indicators. Often it is useful to depict the position of groups formed 
on the basis of socio-demographic characteristics. It is also possible to depict 
the posterior means of individual cases in the plot, yielding what is sometimes 
referred to as a bi-plot. 



2.3 Application: Types of Survey Respondents 

We will now consider an example that illustrates how the LC factor model can 
be used with nominal variables. It is based on the analysis of 4 variables from the 
1982 General Social Survey given by McCutcheon [13] to illustrate how standard 
LC modeling can be used to identify different types of survey respondents. Two 
of the variables ascertain the respondent’s opinion regarding the purpose of 
surveys (Purpose) and how accurate they are (Accuracy), and the others are 
evaluations made by the interviewer of the respondent’s levels of understanding 
of the survey questions (Understanding) and cooperation shown in answering the 
questions (Cooperation). McCutcheon initially assumed the existence of 2 latent 
classes corresponding to ’ideal’ and ’less than ideal’ types. The study included 
separate samples of white and black respondents. Here, I use the data of the 
white respondents only. 

The two-class LC model - or, equivalently, the 1-factor LC model ~ does not 
provide a satisfactory description of this data set (L^ = 75.5; df = 22] p < .001). 
Two options for proceeding are to increase the number of classes or to increase 
the number of factors. The 2-factor LC model fits very well (L^ = 11.1; df = 15; 
p = .75), and also much better than the unrestricted 3-class model (L^ = 22.1; 
df = 15; p = .11) that was selected as final model by McCutcheon. 
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Table 1. Logit parameter estimates for the 2-factor LC model as applied to the GSS’82 
respondent-type items 



Item 


Category 


Xl X2 


Purpose 


good 


-1.12 2.86 




depends 


0.26 -0.82 




waste 


0.86 3.68 


Accuracy 


mostly true 


-0.52 -1.32 




not true 


0.52 1.32 


Understanding good 


-1.61 0.58 




fair/poor 


1.61 -0.58 


Cooperation 


interested 


-2.96 -0.57 




cooperative 


-0.60 -0.12 




impatient /hostile 3.56 0.69 




■it 

O 

o 

o 



Fig. 1. Graphical display of category-specific posterior factor means for the 2-factor 
LC model as applied to the GSS’82 respondent-type items 



The logit parameter estimates obtained from the 2-factor LC model are given 
in Table 1. These show the magnitude of the relationship between the observed 
variables and the two factors. As can be seen, the interviewers’ evaluations of 
respondents and the respondents’ evaluations of surveys are clearly different 
factors: Understanding and Cooperation are more strongly affected by the first 
factor and Purpose and Accuracy by the second factor. 

Figure 1 depicts the bi-plot containing the category-specific factor means of 
the four indicators. The plot shows even more clearly than the logit coefficients 
that the first dimension differentiates between the categories of Understanding 
and Cooperation and the second between the categories of Purpose and Accu- 
racy. 
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3 LC Regression Analysis 

One of the differences between LC regression analysis and the other forms of 
LC analysis discussed so far is that it concerns a model for a single response 
variable. This response variable is explained by a set of predictors, where the 
predictor effects may take on different values for each latent class (see [10,23,24] 
and Sect. 13.2 in [1]). 

An important feature of LC regression models is that for each case we may 
have more than one observation. These multiple observations may be experimen- 
tal replications, repeated measurements at different time points or occasions, 
clustered observations, or other types of dependent observations. Here, I will use 
the term replications, where the replication number will be denoted by k. The 
value of the response variable for case i at replication k is denoted by yik- The 
number of replications, which is not necessarily the same for all cases, is denoted 
by Ki. Because we are dealing with models with a single latent variable, we drop 
the index i from xi. 

Note that I am describing a two-level data structure in which a predictor may 
either have the same value or change its value across replications. The former 
are the higher-level or level-2 predictors and the latter are lower-level or level- 
1 predictors. Here, k indexes the (dependent) lower-level observations within a 
certain higher-level observation. Level-1 predictors will be denoted as Zikp and 
level-2 predictors as Wiq. The LC regression model can be used to define (non- 
parametric) two-level or random-coefficient models. Using k as an index for time 
points or time intervals, one obtains models for longitudinal data, such as growth 
or event-history models with non-parametric random coefficients (see [17] and 
[23]). ^ 

Using the same notation as above, the probability structure underlying the 
LC regression model can be defined as 



Ti Ki 

f{yi\wi,Zi) = ^P(x) f{yik\x,w„Zik) . 
x=l k=l 

Similarly to other LC models, replications are assumed to be independent given 
class membership. For nominal or ordinal dependent variables, the probability 
density f{yik\x,Wi,Zik) will usually be assumed to be multinomial, for contin- 
uous variables, univariate normal, and for counts, Poisson or binomial. 

The linear predictor in f{yik\x,'Wi,Zik) equals 

P Q 

T]{yik\x,Wi,Zik) = fiox + E i^px ^ikp E Pp+q'^iq 

p=l q=l 

where P and Q denote the number of level- 1 and level-2 predictors. This regres- 
sion model contains a class-specific intercept, P class-specific regression coeffi- 
cients, and Q class-independent regression coefficients. The P-|- 1 coefficient that 
are class dependent are random coefficients. 
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The conceptual equivalence between the LC regression model and a two-level 
random-coefficient model becomes even clearer if one realizes that it is possible 
to compute the means, variances, and covariances of the class-specific coefficients 
from the standard LC class output. These are obtained by elementary statistics 
calculus: 



T 

x—1 

T 

'Ppp' ^ ^ {Ppx i,(^p'x Mp') 

x—1 

This shows that LC regression analysis results can be summarized to yield infor- 
mation that is equivalent to the one obtained in regression models with random 
coefficients coming from a normal distribution; that is, it possible to obtain the 
mean vector and the covariance matrix of the random coefficients. 



3.1 Application: Longitudinal Study on Attitudes towards Abortion 

In order to demonstrate the non-parametric random-coefficient model, I used 
a data set obtained from the data library of the Multilevel Models Project, at 
the Institute of Education, University of London. The data consist of 264 par- 
ticipants in 1983 to 1986 yearly waves from the British Social Attitudes Survey 
(see [14]). It is a three-level data set: individuals are nested within constituencies 
and time-points are nested within individuals. I will only make use of the latter 
nesting, which means that we are dealing with a standard repeated measures 
model. As was shown by Goldstein [5] , the highest level variance - between con- 
stituencies - is so small that it can reasonably be ignored. Below, I will show 
how to extend the LC model to deal with higher-level data structures. 

The dependent variable is the number of yes responses on seven yes/no ques- 
tions as to whether it is woman’s right to have an abortion under a specific 
circumstance. Because this variable is a count with a fixed total, it most natural 
to work with a logit link and binomial error function. Individual level predictors 
in the data set are religion, political preference, gender, age, and self-assessed 
social class. In accordance with the results of Goldstein, I found no significant ef- 
fects of gender, age, self-assessed social class, and political preference. Therefore, 
I did not used these predictors in the further analysis. The predictors that were 
used are the level-1 predictor year of measurement (1=1983; 2=1984; 3=1985; 
4=1986) and the level-2 predictor religion (l=Roman Catholic, 2=Protestant; 
3=Other; 4=No religion). Effect coding is used for nominal predictors. 

The LC regression models were estimated by means of version 3.0 of the 
Latent GOLD program (see [21]), which also provides the multilevel type pa- 
rameters fi and a/t^. I started with three models without random effects: an 
intercept-only model (la), a model with a linear effect of year (Ib), and a model 
with year dummies (Ic). Models Ib and Ic also contained the nominal level-2 
predictor religion. The test results reported in the first part of Table 2 show 
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Table 2. Test results for the estimated models with the attitudes towards abortion 
data 



Model 


Log-likelihood parameters BIC 


No random effects 


la. empty model 


-2309 


1 


4623 


Ib. time linear -|- religion 


-2215 


5 


4458 


Ic. time dummies -I- religion 


-2188 


7 


4416 


Ic + Random intercept 


Ila. 2 classes 


-1755 


9 


3560 


Ilb. 3 classes 


-1697 


11 


3456 


lie. 4 classes 


-1689 


13 


3451 


lid. 5 classes 


-1689 


15 


3461 


Ic + Random intercept and slope 


Ilia. 2 classes 


-1745 


12 


3558 


Illb. 3 classes 


-1683 


17 


3460 


IIIc. 4 classes 


-1657 


22 


3436 


Illd. 5 classes 


-1645 


27 


3441 



Table 3. Parameters estimates obtained with Model IIIc for the attitudes towards 
abortion data 



Parameter Class 1 Class 2 Class 3 Class 4 Mean Std.Dev. 



Class size 


0.30 


0.28 


0.24 


0.19 




Intercept 


-0.34 


0.60 


3.33 


1.59 1.16 


1.38 


Time 


1983 


0.14 


0.26 


0.47 


-0.58 0.12 


0.35 


1984 


-0.11 


-0.46 


-0.35 


-1.11 -0.45 


0.34 


1985 


-0.04 


-0.44 


-0.26 


1.43 -0.10 


0.66 


1986 


-0.06 


0.64 


0.14 


0.26 0.24 


0.27 


Religion 


Catholic 


-0.53 


-0.53 


-0.53 


-0.53 -0.53 


0.00 


Protestant 


0.20 


0.20 


0.20 


0.20 0.20 


0.00 


Other 


-0.10 


-0.10 


-0.10 


-0.10 -0.10 


0.00 


No Religion 


0.42 


0.42 


0.42 


0.42 0.42 


0.00 



that year and religion have significant effects on the outcome variable and that 
it is better to treat year as non-linear. I proceeded by adding a random intercept. 
The test results show that the model with 4 classes is the best one in terms of 
BIC value. Subsequently, I allowed the time effect to be class specific. Again, the 
4-class model turned out to be the best according to the BIC criterion. 

Table 3 reports the parameter estimates for Model IIIc. The means indicate 
that the attitudes are most positive at the last time point and most negative 
at the second time point. Furthermore, the effects of religion show that people 
without religion are most in favor and Roman Catholics and Others are most 
against abortion. Protestants have a position that is close to the no-religion 
group. 
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Table 4. Estimates of the parameters for 3-class choice model 



Parameter Class 1 Class 2 Class 3 Mean Std.Dev. 



Class size 


0.51 


0.26 


0.24 






Fashion 


3.03 


-0.17 


1.20 


1.77 


1.37 


Quality 


-0.09 


2.72 


1.12 


0.92 


1.16 


Price 


-0.39 


-0.36 


-0.56 


-0.42 


0.08 


None 


1.29 


0.19 


-0.43 


0.60 


0.73 



As can be seen, the 4 latent classes have very different intercepts and time 
patterns. The largest class 1 is most against abortion and class 3 is most in favor 
of abortion. Both latent classes are very stable over time. The overall level of 
latent class 2 is somewhat higher than of class 1, and it shows somewhat more 
change of the attitude over time. People belonging to latent class 4 are very 
instable: at the first two time points they are similar to class 2, at the third 
time point to class 4, and at the last time point again to class 2 (this can be 
seen by combining the intercepts with the time effects). Class 4 could therefore 
be labelled as random responders. It is interesting to note that in a three-class 
solution the random-responder class and class two are combined. Thus, by going 
from a three- to a four-class solution one identifies the interesting group with 
less stable attitudes. 

3.2 Application: Choice-Based Conjoint Study 

The LC regression model is a popular tool for the analysis of data from conjoint 
experiments in which individuals rate separate product or choose between sets of 
products having different attributes (see [10]). The objective is to determine the 
effect of product characteristics on the rating or the choice probabilities or, more 
technically, to estimated the utilities of product attributes. LC analysis is used 
to identify market segments for which these utilities differ. The class-specific 
utilities can be used to estimate the market share of possible new products; that 
is, to simulate future markets. 

For illustration of LC analysis of data obtained from choice-based conjoint 
experiments, I will use a generated data set. The products are 12 pairs of shoes 
that differ on 3 attributes: Fashion (0=traditional, 1= modern). Quality (0=low, 
l=high), and Price (ranging from 1 to 5). Eights choice sets offer 3 of the 12 
possible alternative products to 400 individuals. Each choice task consists of 
indicating which of the three alternatives they would purchase, with the response 
’’none of the above” allowed as a fourth choice option. 

The regression model that is used is a multinomial logit model with choice- 
specific predictors, also referred to as the conditional logit model. The BIC values 
indicated that the three-class model is the model that should be preferred. The 
parameter estimates obtained with the 3-class model are reported in Table 4. 
As can be seen. Fashion has a major influence on choice for class 1, Quality for 
class 2, and both Fashion and Quality affect the choice for class 3. The small 
differences in price effect across the three classes turned out to be insignificant. 
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In addition to the conditional logit model which shows how the attributes 
affect the likelihood of choosing one alternative over another, differentially for 
each class, I specified a second logit model to describe the latent class variable as 
a function of the covariates sex and age. Females turn out to belong more often 
to class 1 and males to class 3. Younger persons have a higher probability of 
belonging to class 1 (emphasize Fashion in choices) and older persons are most 
likely to belong to class 2 (emphasize Quality in choices). 



4 LC Models for Nested Data Structures 



As explained in the context of the LC regression model, LC analysis is a technique 
for analyzing two-level data structures. In most cases, this will be repeated mea- 
sures or item responses that are nested within individuals. Here, I will present a 
three-level extension of the LC model and discuss the complications in parameter 
estimation, as well as indicate how these complications can be resolved. 

Before proceeding, some additional notation has to introduced. Let yijk de- 
note the response of individual j within group i on indicator or item k. The 
number of groups is denoted by N, the number of individuals within group i by 
Tii, and the number of items by K. The latent class variable at the individual 
level is denoted as Xj. For reasons that will be clear below, I will use the index j 
in X when referring to the latent class membership of a certain individual within 
a group. 

The standard method for analyzing such grouped data structures is the 
multiple-group LC model (see [3]). A multiple-group LC model with group- 
specific class sizes would be of the form 



K 



P{yi) = n 51 I n Piv^3k\x) \ P{xj\i). 



j=lx=l tfc=l 



As can be seen, observations within a group are assumed to be independent of 
each other given the group-specific latent distribution P{xj\i). 

A disadvantage of this “fixed-effects ’’approach is that the number of un- 
known parameters increases rapidly as the number of groups increases. An al- 
ternative is to assumed that groups belong to latent classes of groups, denoted 
by ru, that differ with respect to the latent distribution of individuals. This yields 
a LC model of the form 



M 



K 



P{y^) = E 



n E I n p(y^jk\xj) > p{xj\w) 



w—l 






P{w). 



This model can be represented as a graphical model containing one latent vari- 
able at the group level and one latent variable for each individual within a group. 
The fact that the model contains so many latent variables makes the use of a 
standard EM algorithm for maximum likelihood estimation impractical. 
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The contribution of group i to the completed data log-likelihood that has to 
be solved in the M step of the EM algorithm has the form 

M T Ui K 

log L, = EEEE P{xj,w\yi)\ogP{yijk\xj) 

w—1 x—1 j—1 k—1 
M T m 

+ EEE P{xj,w\y^)logP{xj\w) 

W—1 X—1 j—1 

M 

w—1 

This shows that the “only” thing that has to be obtained in the E step of the 
EM algorithm are the T ■ M marginal posteriors P{xj^w\yi) for each individual 
within a group. It turns out that these can be obtained in an efficient manner by 
making use of the conditional independence assumptions implied by underlying 
graphical model. More precisely, the new algorithm makes use of the fact that 
lower-level observations are independent of each other given the higher-level class 
memberships. The underlying idea of using the structure of the model of interest 
for the implementation of the EM algorithm is similar to what is done in hidden 
Markov models. For these models, Baum et al. in [2] developed an efficient EM 
algorithm which is known as the forward-backward algorithm because it moves 
forward and backward through the Markov chain. Vermunt in [18] called the 
version of EM for the new LC model the upward-downward algorithm because 
it moves upward and downward through the hierarchical structure: First, one 
marginalizes over class memberships going from the lower to the higher levels. 
Subsequently, the relevant marginal posterior probabilities are computed going 
from the higher to the lower levels. The method can easily be generalized to data 
structures consisting of more than three levels. Moreover, it cannot only be used 
in LC cluster-like applications, but also in the context of LC regression analysis. 

The upward-downward algorithm makes use of the fact that 

P{xj,w\yi) = P{w\yi)P{xj\yi,w) = P{w\yi)P{xj\y^j,w)-, 

that is, that given class membership of the group (ru), class membership of the 
individuals (xj) is independent of the information of the other group members. 
The terms P{w\yi) and P(xj\yij,w) are obtained with the model parameters: 

p/ . , ^ _ Pixj,y^J\w) _ P{xj\w) Uk=iPiyyk\xj) 

, PMn;UP{y^jw) 

PMUj^i P(y^3H 

where P{yij\w) = P{^j\'^) P{Vijk\xj)- In the upward part, we com- 

pute P{xj,yij\w) for each individual, collapse these over Xj to obtain P{yij\w), 
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and use these to obtain P{w\yi) for each group. The downward part involves 
computing P{xj,w\yi) for each individual using P{w\yi) and P{xj\yij,w). 

In the upward-downward algorithm computation time increases linearly with 
the number of individuals within groups instead of exponentially, as would be 
the case in a standard E step. Computation time can be decreased somewhat 
more by grouping records with the same values for the observed variables within 
groups. A practical problem in the implementation of the upward-downward 
method is that underflows may occur in the computation of P{w\yi). More 
precisely, because it may involve multiplication of a large number (1 -t-n^ • K) of 
probabilities, the term P{w) 0^=1 -P(yijl'^) become equal to zero for each 
w. Such underflows can, however, easily be prevented by working on a log scale. 
Letting a™ = log[P(w)] -I- I]”Mog[P(yij|'u;)] and h = max(aj„), P{w\yi) can 
be obtained as follows: 



P{w\y^) 



exp [aj^ - bj] 
X;“exp[a*„ - h)] 



4.1 Application: Team Differences in Perceived Task Variety 

In a Dutch study on the effect of autonomous teams on individual work condi- 
tions, data were collected from 41 teams of two organizations, a nursing home 
and a domiciliary care organization. These teams contained 886 employees. For 
the example, I took five dichotomized items of a scale measuring perceived task 
variety (see [16]). The item wording is as follows (translated from Dutch): 

1. Do you always do the same things in your work? 

2. Does your work require creativity? 

3. Is your work diverse? 

4. Does your work make enough usage of your skills and capacities? 

5. Is there enough variation in your work? 

The original items contained four answer categories. In order simplify the 
analysis, I collapsed the first two and the last two categories. Because some 
respondents had missing values on one or more of the indicators, the estimation 
procedure was adapted to deal with such partially observed indicators. 

The fact that this data set is analyzed by means of LC analysis means that it 
is assumed that the researcher is interested in building a typology of employees 
based on their perceived task variety. On other hand, if one would be interested 
in constructing a continuous scale, a latent trait analysis would be more appro- 
priate. Of course, also in that situation the multilevel structure should be taken 
into account. 

Table 5 reports the log-likelihood value, the number of parameters, and the 
BIG value for the models that were estimated. I first estimated models without 
taking the group structure into account. The BIG values for the one to three 
class model (Models I-III) without a random latent class distribution show that 
a solution with two classes suffices. Subsequently, I introduced group-specific 
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Table 5. Test results for the estimated models with the task variety data 



Model Individuals Groups Log-likelihood # parameters BIC 



I 


1 class 


1 class 


-2685 


5 


5405 


II 


2 classes 


1 class 


-2385 


11 


4844 


III 


3 classes 


1 classes 


-2375 


16 


4859 


IV 


2 classes 


2 classes 


-2367 


13 


4822 


V 


2 classes 


3 classes 


-2366 


15 


4835 



latent distributions in the two-class model (Models IV and V). From the results 
obtained with these two models, it can be seen that there is clear evidence 
for between-team variation in the latent distribution: These models have much 
lower BIC values than the two-class model without group-specific class sizes. The 
model with three classes of groups (Model V) has almost the same log-likelihood 
value as Model IV, which indicates that no more than two latent classes of teams 
can be identified. 

The conditional response probabilities obtained with Model IV indicated that 
the first class has a much lower probability of giving the high task-variety re- 
sponse than class two on each of the five indicators. The two classes of team 
members can therefore be named “low task- variety” and “high task- variety” . 
The two classes of teams contained 37 and 63 percent of the teams. The pro- 
portion of team members belonging to the high task-variety class are .41 and 
.78, respectively. This means, for instance, that in the majority of teams (63%) 
the majority of individuals (78%) belong to the high task-variety group. The 
substantive conclusion based on Model IV would be that there are two types 
of employees and two types of teams. The two types of teams differ consider- 
ably with respect to the distribution of the team members over the two types of 
employees. 
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Abstract. Several known procedures transforming an imprecise proba- 
bility into a precise one focus on special classes of imprecise probabilities, 
like belief functions and 2-monotone capacities, while not addressing 
the more general case of coherent imprecise probabilities, as defined by 
Walley. In this paper we first analyze some of these transformations, 
exploring the possibility of applying them to more general families of 
uncertainty measures and evidencing their limitations. In particular, the 
pignistic probability transformation is investigated from this perspective. 
We then propose a transformation that can be applied to coherent im- 
precise probabilities, discussing its properties and the way it can be used 
in the case of partial assessments. 



1 Introduction 

The study of transformation procedures between uncertainty representation for- 
malisms attracts the interest of researchers for several reasons. Firstly, transform- 
ing a more complex representation into a simpler one is often computationally 
advantageous. Further, these procedures are a key issue for enabling interop- 
erability among heterogeneous uncertain reasoning systems, adopting different 
approaches to uncertainty, as is the case in multi-agent systems [1] . 

In this paper we focus on procedures for transforming an imprecise prob- 
ability, defined on a finite set of events, into a precise one. Among the main 
choices underlying the definition of such a procedure we mention the class of 
imprecise probabilities to which the procedure is applicable and the criteria 
that define desirable transformation properties. As to the former point, we are 
aware of transformation proposals regarding some classes of imprecise probabil- 
ities, namely possibilities, belief functions and 2-monotone capacities. As to the 
latter, several criteria have been considered in literature proposals, as we will 
discuss in Sect. 3. 

As to our knowledge, no existing proposal addresses the problem of defining a 
transformation procedure from a coherent imprecise probability [2] into a precise 
one. This problem is more general in two main respects: 
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— on one hand, coherent imprecise probabilities cover a larger set of uncertainty 
assignments and include as special cases the subclasses mentioned above [3]; 

— on the other hand, a coherent imprecise probability can be assigned on an 
arbitrary set of events, while other formalisms require an algebraic structure. 

In this paper, after recalling some basic concepts in Sect. 2, we first discuss 
several transformation procedures in Sect. 3, investigating to which extent they 
can be generalized to coherent imprecise probabilities. In particular we discuss 
Voorbraak’s transformation in Sect. 3.1, the pignistic probability transformation 
in Sect. 3.2, uncertainty invariant transformations in Sect. 3.3. In Sect. 4 we 
provide some empirical results about the relevance of specific subclasses within 
coherent imprecise probabilities and evidence the necessity of introducing a new 
procedure, which is then defined and discussed, considering also the case of 
partial assessments. Section 5 concludes the paper. 

2 Basic Concepts 

We use the symbol Q to denote both the certain event and a finite set - also 
called universal set or partition - of pairwise disjoint (non-impossible) events 
whose union is the certain event: Q = {wi, . . . , wat}. Then wi, . . . , wat are called 
atoms, 2^ is the powerset of fl, \A\ is the cardinality of A G 2^, i.e. the number 
of distinct atoms of 17 whose union is A. 

A mapping C : 2^ — >■ [0, 1] is a (normalized) capacity [4] whenever: 

C(0) = 0; C(17) = 1; C{Ai) < ^(Aa), VAi, A 2 G 2^ such that Ai c A 2 . 

A capacity is 2-monotone iff VA, B G 2^ , C{AUB) > C{A) + C{B) — C{AnB). 
Belief functions [5], and in particular necessity measures [6] and, when defined 
on 2^, finitely additive probabilities, are special cases of 2-monotone capacities. 
Their definitions are well known; we shall recall in Proposition 1 their charac- 
terizations in terms of their Mobius inverses [4]. 

For any / : 2^ — >■ K, there is a one-to-one correspondence between / and its 
Mobius inverse or mass function m : 2^ — >■ R, given VA G 2^ by ([4], [7]): 

m(A) = ^ (-1)I^\^I/(B), /(A) = ^ m{B) (1) 

BCA BCA 

The events A G 2^ such that m(A) 0 are called focal elements. 

Proposition 1. Given / : 2^ — >■ R, let m he its Mobius inverse. Then 

(a) f is a capacity iff m is such that: m(0) = 0; m{B) = 1; 

Eu^gbcaMB) > 0, VA g 2^,Vw G a. 

Further, if f is a capacity and F the set of its focal elements, then 

(b) f is a 2-monotone capacity iff'iA,B G 2^, EccAuB,cfA,cfBMC) > 0; 

(c) f is a belief function iff m is non-negative; 

(d) f is a necessity measure ijf F is totally ordered by relation ‘C 

(e) f is a (precise) probability ijf {A G F) (A atom o/l7). 
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Note that if / is a capacity, /(w) = m{uj) > 0, Vw G I?. 

The conjugate C of a capacity C is defined by C'(A) = 1 — C{A'^), \/A G 2^. 
The conjugate of a 2-monotone capacity (a belief function, a necessity measure) 
is termed 2-alternating capacity (plausibility function, possibility). A precise 
probability is self-conjugate. 

The notion of coherent lower probability^ specializes that of (coherent) lower 
prevision and is defined in [2], sec. 2.5, referring to an arbitrary (finite or not, 
structured or not) set of events S. We assume here that S C 2^. 

Coherent lower probabilities are indirectly characterized as lower envelopes on 
S of precise probabilities. Denoting with Ai the set of all precise probabilities 
dominating P on A (i.e. P G iff P{A) > P(A),VA G S), the following lower 
envelope theorem holds [2]: 

Proposition 2. P is a coherent lower probability on S iff there exists a (non- 
empty) set D C A4 such that P_{A) = fn/pgc{P(A)}, VA G S (inf is attained). 

One may refer to either lower (P) or upper (P) probabilities only, exploiting 
conjugacy. In particular, assessing both P(A) and P(A) is equivalent to assessing 
P(A) and P(A^=) = 1 - P(A). When P(A) = P(A) = P(A),VA G S', P is a 
coherent precise probability on S (a finitely additive probability if S = 2^). 

Later, we shall use the following necessary condition for coherence: 

P(AUP) > P(A) -bP(P),VA,P : AnP = 0 (A,P,AuPgS) (2) 

When S = 2^, lower (upper) probabilities are formally capacities, and in- 
clude as special cases 2-monotone (2-alternating) capacities, therefore also belief 
functions and necessity measures (their conjugates). 

Later in this paper we shall be concerned with transforming an imprecise 
(lower or upper) probability assessment on S into a precise probability. Since S 
will be finite but arbitrary, coherent imprecise and precise probabilities will be 
used. A relevant question will be what does a given lower probability assessment 
P on S' entail on other events, not belonging to S. Concerning this problem, it is 
known that coherent lower probabilities can always be coherently extended on 
any S' D S. In particular, when considering one additional event A, the set of all 
coherent extensions of P to A is a closed (non-empty) interval [P_E{^)TP.ui^y\^ 
where P_e{A) is the natural extension [2] of P to A. That is P_e{A) is the 
least committal or vaguest admissible coherent extension of P on A. It may be 
obtained as the infimum value on A of all precise probabilities dominating P on 
S, which requires solving a linear programming (LP) problem. In principle, other 
coherent extensions may be interesting, especially the upper extension P_e{A) 
which has the opposite meaning of least vague coherent extension of P. However, 
as shown in [8], sec. 5, computing P[/(A) may require solving IS”! distinct LP 
problems. The natural extension has also another advantage: if we compute 
separately P^(A^) for i = 1, . . . , m (i.e. we find separately the natural extension 
of P on Si = S'U{Aj}), then {P£;(Ai), . . . , P^;(Am)} is the natural extension of P 

^ We shall often omit the term coherent in the sequel. 
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on SU {Ai, . . . , Ara\- This is generally not true for any other coherent extension: 
for instance, if we compute P_jj{Ai) and put P{Ai) = P_ij{Ai), then the upper 
extension {A 2 ) of P from S'U{Ai}toS'U{Ai,^ 2 } will be generally less than 
the value Pjj{A 2 ) representing the upper extension of P from S to S' U {A 2 }. 

3 An Analysis of Previonsly Proposed Transformations 

Firstly, we briefly recall different transformation criteria considered in the past. 

Consistency criteria impose some logical relation between the transformed 
uncertainty assignment U 2 and the original assignment U\. In particular, when 
transforming a lower probability P into a precise probability P the commonly 
adopted consistency criterion [9] is applied requiring that 

P{A) > P{A),VA€2^ (3) 

Similarity criteria aim at minimizing some difference between U 2 and C/i; in 
particular two kinds of similarity can be considered: 

— ordinal similarity: some credibility order induced on events by Ui should be 
preserved by U 2 , an example is the (strict) preference preservation principle: 
Ui{A) > Ui{B) ^ U2{A) > U2{B), yA,B€ 2^; 

— quantitative similarity: some distance between Ui and U 2 should be mini- 
mized. 

Selection criteria determine the selection of a particular U 2 when there are 
multiple results compatible with other transformation criteria. 



3.1 Voorbraak’s Bayesian Transformation 



Voorbraak’s Bayesian transformation takes as input the mass function m of a 
given belief function and outputs a mass function my which, in [10], is defined 
to be zero for all non-atomic events of 2^, while for every atom a; G 12 it is: 



mv{oj) 



^BDuj ^{B) 

5Zcci7 ’ni{C) ■ \C\ 



(4) 



Such a mass assignment corresponds to a precise probability Pv{^) = Tnv{u)), 
Vw G fl (cf. Sect. 2). We note that this still holds if m is the mass function of 
a capacity, since again my is non-negative, by Proposition 1(a). The transfor- 
mation can be equivalently rewritten in terms of the initial plausibility values of 
atoms: FV(w) = mv{to) = Pl{uj) / Pl{to). 

In other words, Voorbraak’s proposal is a normalization of the plausibility 
values of atoms of f2 (more generally, when applied to a capacity C, it normalizes 
the values on the atoms of its conjugate C") . As such, it clearly respects an ordinal 
similarity criterion restricted to the plausibility values of the atoms of 17. This 
is a relatively weak property, in particular the transformation does not match 
with the consistency criterion (3), as pointed out e.g. in [11]. 
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In [12] a method for approximating belief functions based on the concept 
of fuzzy T-preorder is proposed. A detailed review of this approach can not be 
carried out here due to space limitations. However this work is mainly focused on 
possibility distributions approximating a belief function, while the existence and 
uniqueness of a probability which approximates the belief function is guaranteed 
only for a specific choice of the T-norm to be used in the transformation. In 
this case, as shown in [12], the probability obtained coincides with Voorbraak’s, 
therefore the same considerations can be applied. 



3.2 Pignistic Probability 



The pignistic probability transformation (PPT) has been considered in a variety 
of publications mainly concerning belief functions (e.g. [11], [13]). It takes in 
input a mass function m and produces a probability Ppign on the atoms of f?: 



Ppign(^^) — '^pign{^) 



E 



m{A) 

TE 



,Vw G I? 



( 5 ) 



This transformation is based on the principle of insufficient reason applied 
to the focal events: it distributes uniformly their masses over their atoms. 

Assuming now that m is the Mobius inverse of a capacity C, it is easy to see 
(using (1), (5) and additivity of Ppign) that 



Pp^gn{B) = C{B) + ^ ^^m(A),Vi? G 2^ (6) 

A<^B 

It is clear from (6) and Proposition 1(c) that PPT preserves the consistency 
criterion when C is a belief function. Now the point is: what if C is not a belief 
function? It is not even immediate that Ppign is then a probability, since nega- 
tive masses appear in (5). The question was first addressed in [13], where Ppign 
was also derived from axioms for combining credibility functions, i.e. capacities 
with some additional axiomatical properties. We extend this result by proving 
in Proposition 4 that PPT returns a precise probability from any capacity. Pre- 
liminarily we summarize some results stated in [4]. 

Proposition 3. Given a capacity C : 2^ — >■ K, an associated set of precise 
probabilities V can be defined as follows. Let B be the set of all permutations of 
the atoms of Q, with \ f2\ = N. Given S € B, S = (wq, . . . , Wj^), define Sq = $ 
and, for n = 1, . . . ,N, U . . . U Psif^i,,) = C{Sn) - c\Sn-i). Let 

V = {Ps-S & B}. Then 

(a) Every Ps is a (precise) probability, and it is, \/uj G G, 



Ps{(^) = E (7) 

where A(A, ws(A)) = 1, u;s(A) being the last element of A in permutation S, 
and \{A,u) = 0, Vw yf a;s(A). 
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(b) The set V coincides with the set of the vertices of the set A4 of all precise 
probabilities dominating C if and only if C is 2-monotone. 



Proposition 4. Let m be the mass function corresponding to a capacity C. Then 
Ppign as defined in (5) is a precise probability. 

Proof Using (7), we have that Ps(w), and consequently Ps{a>), are a 

weighted sum of the masses of events A such that A 5 to. The weight of a 
generic m(A) in ^ses equal to the cardinality of the set {S' \ S' G 

S,u) = i.e. to the number of times \{A,u}) = 1. The cardinality of this 

set is actually |^, since any oj € A has the same chance of being the last element 
of ^ in a given permutation S. Therefore 

1 1 N< 

— ^ PpignH.Vw G 17 (8) 

From (8) and Proposition 3(a), Ppign is a convex combination of precise 
probabilities, and is therefore itself a precise probability. □ 

We answer now a further question concerning how PPT relates to the con- 
sistency criterion. 

Proposition 5. Let C : 2^ ^ M. be a given capacity with mass function m, and 
Ppign be defined by (5). 

(a) Lf C is 2-monotone, Ppign{A) > C{A), VA G 2^. 

(b) LfC is not 2-monotone, Ppign may or may not dominate C. Lf P is a coherent 
lower probability, Ppign dominates If on the atoms of 12, i.e. Ppigni^j) > PfoS), 
Vw G 17, but not necessarily elsewhere. 

Proof. To prove (a), observe that if C is 2-monotone the probabilities Ps are 
vertices of Ai by Proposition 3(b), and as such dominate C; then also their 
convex combination Ppign dominates C (see (8)). 

If C is not 2-monotone, examples may be found where Ppign does not domi- 
nate C. Consider for instance the coherent lower probability If on 2^, with 17 = 
{a, b, c, d, e}, which is the lower envelope of the three precise probabilities Pi, P 2 , 
P 3 , determined by orderly assigning the following values on the atoms a, b, c, d, e: 
Pi - values [0.49,0.35,0.12,0.01,0.03], P 2 - values [0.14,0.03,0.07,0.36,0.40], P 3 
- values [0.36,0.05,0.29,0.14,0.16]. Then Ppign is given by the following values 
on the atoms a,b,c,d,e: [0.31983,0.16316,0.1423,0.1723,0.2023]. We have that 
]f{a A d) = 0.5 > Ppign{a A d) = 0.49216. 

To prove the second part of (b), let w G 17. Considering a permutation S 
(Proposition 3), let r be the position of co in S, i.e. tOi,. = uj. Then Ps{u)) = 
Psi^J,,,.) = P(wii U . . . U LJi,.) - Z(wii U . . . U J > P(wi,,) = P{u;), using (2) 
in the inequality. Since S is arbitrary, Ps{uj) > P{co),\/S. Hence also the convex 
combination Ppign of the probabilities Ps is such that Ppign{uj) > P{u}). □ 
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We may conclude from Proposition 5 that PPT preserves the consistency 
criterion as far as it is applied to 2-monotone capacities. It may not preserve 
it outside 2-monotonicity, even though consistency may at least partially hold, 
as demonstrated in (b). To get some empirical insight of the behaviour of PPT 
outside 2-monotonicity, we randomly generated a large number of coherent lower 
probabilities which were not 2-monotone (see Sect. 4), computing also their 
corresponding Ppign' it has to be reported that the percentage of non-dominating 
Ppign was relatively low, and dominance violation numerically rather small. One 
of the examples we found is used in the proof of Proposition 5. 

The pignistic probability has also been interpreted (e.g. in [9], which focuses 
on possibility measures) as center of gravity of the vertices of M. This inter- 
pretation is clearly supported by Proposition 3(b) and (8), from which it is also 
patent that its validity is limited to 2-monotone probabilities. However the in- 
terpretation seems to us debatable even in the context of 2-monotonicity. In 
fact, from (8), Ppign is the average of A^! probabilities Ps, which are not neces- 
sarily all distinct: distinct permutations in Proposition 3 may well originate the 
same probability (examples are easily found) . This means that Ppign is actually 
a weighted average of the distinct Ps, and the weight of any Ps is given by the 
number of distinct permutations S which give rise to it (analogously, the vertices 
obtained in [9] by means of ‘selection functions’ are also not necessarily distinct) . 

A point which seems therefore difficult to justify in the center of gravity 
interpretation is the real meaning of the weights in terms of the initially assigned 
2-monotone capacity. More generally, it is also questionable, when transforming 
a given uncertainty measure /i, to use a criterion which is based on properties of 
different uncertainty measures, that only indirectly relate to fx. 

3.3 Uncertainty Invariant Transformations 

In [14] the uncertainty invariance principle is proposed, which states that un- 
certainty transformations should not modify the information contents of a given 
assignment. To apply this principle a measure of information has to be defined 
for the uncertainty measures involved in the transformation process. Several 
proposals to extend the classical Hartley measure for crisp sets and Shannon 
entropy for precise probabilities to fuzzy sets and belief functions are considered 
in [14]. Those based on mass values, interpreted as ‘degrees of evidence’, can not 
be extended to the case of imprecise probabilities, where masses can be negative 
and have no clear intuitive interpretation [15]. On the other hand, the definition 
of the aggregate uncertainty measure (AU), namely the maximum value of the 
Shannon entropy among all probability distributions dominating a given belief 
function, could be directly extended to coherent imprecise probabilities. Some 
limitations of AU are pointed out in [16]. In particular, AU does not distin- 
guish among all imprecise probabilities which are consistent with the uniform 
probability, including the case of total ignorance. 

A transformation based on the uncertainty invariance principle and using the 
AU measure consists in determining the maximum entropy precise probability 
among those in M . . This is equivalent to adopting the consistency criterion with 
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maximum entropy as a selection criterion, but is intrinsically in contrast with 
similarity and in particular with preference preservation, tending to equate all 
probability values of atoms, as far as allowed by consistency. In our opinion, 
this is a drawback of the criterion in transformations devoted to uncertainty 
interchange. A discussion of pros and cons of maximum entropy methods may 
be found in [2], sec. 5.12, where it appears that these methods may be appropriate 
in certain specific decision problems. 

Apart from theoretical issues, as to our knowledge no algorithm has been 
devised for computing the maximum entropy probability Pme consistent with 
an imprecise probability assessment. In the case of 2-monotone capacities, it 
is shown in [17] that the solution is unique and an algorithm is provided. We 
checked that the procedure does not ensure the consistency condition (3) out- 
side 2-monotonicity. Consider for this the lower probability assignment P in 
the proof of Proposition 5(b). The resulting Pme, determined by the values 
[0.26,0.16,0.16,0.16,0.26] on the atoms of C, is such that Pme{cl\J d) = 0.42 < 
P(aUd) = 0.5. 



4 An Imprecise to Precise Probability Transformation 

Even though there are important situations which cannot be adequately de- 
scribed by 2-monotonicity ([2], sec. 5.13.4), it is also known that 2-monotonicity 
arises in certain contexts, for instance when using pari-mutuel models (at race- 
tracks or in life insurance) [2] , or more generally convex transformations of pre- 
cise probabilities [18]. As shown in the previous section, some transformation 
procedures preserve their applicability or some important properties only in the 
context of 2-monotone capacities. One may then wonder whether these proce- 
dures should be applied to imprecise probabilities too, assuming that a ‘large’ 
part of them is 2-monotone. To give an empirical answer to this question, we 
wrote a computer program in Java language whose main loop randomly gener- 
ates a coherent imprecise probability and verifies whether it is a belief function 
and, if not, whether it is 2-monotone. All random selections are carried out using 
the standard Java function Math. random which generates pseudorandom num- 
bers in [0, 1] , with an approximately uniform distribution (see documentation at 
http://java.sun.eom/j2se/l.4.l/docs/api/java/util/Random.html for details). 

The program was run for \fl\ = 3, . . . , 10 and generated 100000 imprecise 
probabilities for each cardinality. The results are shown in Table 1. 

As can be noted, the relevance of 2-monotone capacities rapidly decreases 
as \fl\ increases. Moreover, for > 5, the number of 2-monotone capacities 
which are not belief functions is extremely small (11 over 874 for \fl\ = 10). 
These results suggest that, in general, 2-monotone capacities can not be consid- 
ered a numerically adequate representative of coherent lower probabilities. 
Another point from Sect. 3 is that known transformations often make use of the 
mass function m. This might suggest seeking for a transformation based on m 
in our context too. However the interpretation of m for imprecise probabilities 
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Table 1. Percentage of belief functions and 2-monotone capacities within randomly 
generated coherent lower probabilities 



Cardinality of 1? 


belief functions 


2-monotone capacities 


3 


90.58 


100 


4 


9.46 


13.08 


5 


2.80 


2.99 


6 


1.71 


1.77 


7 


1.27 


1.31 


8 


1.12 


1.14 


9 


1.02 


1.04 


10 


0.863 


0.874 



is unclear (see [15] for a discussion). Also we are aware of no characterization of 
imprecise probabilities in terms of m (like those in Proposition 1).^ 

Further, we shall be interested in transforming assessments on a generic (fi- 
nite) set of events. For instance, in the case of a multi-agent system, agents are 
not necessarily interested in exchanging information about the whole 2^ but 
may focus on a more restricted set S'/att of interesting events. Although the 
mass function exists also in the partial case [7], it is easy to see that it does not 
preserve the properties it has in a complete assignment. 

We shall now illustrate another transformation procedure, which extends a 
proposal initially presented in [1]. Defining S* = 2^ \ {0,17}, we suppose at 
first Sint = S*: the case Sjnt C S* will be considered later. We require the 
transformation to meet the consistency principle, which in terms of upper and 
lower probabilities imposes for the resulting precise probability P* that 

P{A) < P*{A) <P{A),\/Ag S* (9) 

When |17| = 2 (hence S* = {A,A‘^}), P* may be fully determined from 
P*{A) = ^AAlAEJA) ^ This seems reasonable, since P* reduces then the impreci- 
sion of both P and P by the same amount, and there is no reason for P* to be 
closer to either of P or P. A straightforward generalization for |17| > 2 of the 
idea of eliminating imprecision in a symmetric way for each event in S* leads to 
considering P^ = \/j[ g 

In general Pm is not a precise probability, but we may choose a probability 
P* close to it in some way. Obviously there are several approximation choices; 
selecting in particular the common least-squares approximation of Pm leads to 
the following transformation problem (TP): 

minq>= Y, - Pm{A)f (10) 

AeS* 

^ Several necessary conditions for coherence may be found. For instance, rewriting (2) 
using (1) leads to condition J^ccaus c<^a c<^b ’^(O) > 0, VA, B ■. AP\ B = %, which 
is a special case of the 2-monotonicity characterization in Proposition 1(a) and also 
implies (putting A = Ui, B = coj ^ Ui) m{A) > 0 if \A\ = 2. 
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with the constraints (9) and 

N 

P*{A)= P*{u;,)>0,i = l,---,N-, ^P*(o;i) = l (11) 

ujGA i—1 

The variables in TP are P*{uji), . . . ,P*{u>m)- Since (p is convex on and 
the set of constraints Sc is a (non-empty)^ polyhedral set, TP is a convex 
quadratic programming problem, for which polynomial-time solving algorithms 
are known (see e.g. [19], sec. 11.2). TP has some desirable properties, which we 
derived using well-known results in calculus and convex programming: 

(a) problem TP always returns a unique P*. In particular, TP detects whether 
Pm is a precise probability, since in such a case it gives P* = Pm- 

(b) It may be useful to solve the linear system which equates to zero the gradient 
vector of p. In fact, if its (unique) solution is an interior point in Sc then it 
is the required P*; otherwise we get to know that P* will be equal to either 
P{A) or P(gI) for at least one A G S*. 

(c) If P{A) = 1, P{A) = 0, G S* (vague statement), TP returns the uniform 
probability as P* (that may be seen applying (b)). 

Let us now suppose that Sjnt C S*, which is the partial assignment case. 
Most known transformations can not be directly applied to partial assign- 
ments, because they are based on quantities which are typically defined on the 
whole 2^. They might anyway be applied indirectly, extending the coherent lower 
probability to 2^. As discussed in Sect. 2, the natural extension appears to 
be the most appropriate extension both theoretically and computationally. A 
transformation requiring a complete assignment could then be applied to on 
2^, with the same limitations discussed, for each case, in Sect. 3. 

The transformation we are proposing may be applied directly to partial as- 
sessments, as far as both lower and upper probability values are assigned for 
each A G Sint- 

If this condition holds, it suffices to replace S* with Sjnt in TP; otherwise the 
natural extension (of either F or F) on = Sjnt U {A : G Sjnt} should 

be computed before replacing S* with Sjnt- In both cases the transformation 
problem still returns a unique coherent precise probability P*{A),'iA {A G Sjnt 
or A G Sf^j-, respectively). Note that P* (uJi), - . - , P*{u)n) are generally not 
uniquely determined in the partial assessment case. 

The direct way appears more appealing at a first glance. However, the 
choice between direct and indirect way is not straightforward, and this does 
not uniquely depend on the specific transformation we are applying here. 

On one hand, the direct way avoids or reduces the computations required to 
determine the natural extension. Moreover, in the case of information exchange 
between agents adopting different uncertainty formalisms, no non-requested in- 
formation is introduced. On the other hand, a partial assignment contains some 

® Non-emptiness is implied by coherence, which ensures that At A 0 in the lower 
envelope theorem. 
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implicit information which is actually ignored by the direct way, but could af- 
fect transformation results if considered. Therefore ignoring all implications of a 
given partial assessment might be expected to originate an unsatisfactory trans- 
formation result. For instance, there may sometimes be a unique coherent ex- 
tension to some event(s) in S*. Should this piece of information be ignored? The 
following example illustrates this situation. 

Example. Given f 2 = {wi,a;2,a’3} and Sj^t = {wijWiUw2}, the assignment 
P{uji) = P{uJi U UJ2) = a, P{uji) = P(wi U W2) = 1 — a, (a G [0, ^]) on Sint is 
equivalent to the lower probability assignment P{uii) = P_{oJi U W2) = Eipoz) = 
P{oj2 U W3) = a on the set U W2,a’3,a'2 U W3}, which is easily 

seen to be coherent (for instance, using the envelope theorem). Since upper and 
lower probabilities are given for every event in Sjntj we may consider finding 
P* in a direct way. Here Pm{<-Oi) = Pm{<-^1 U W2) = | is a coherent probability 
on Sint (being the restriction on Sint of a probability on 2^ obtained from 
Pm{uJi) = PmiuJs) = 5, Pm{(^2) = 0), hence P* = Pm- 

However, using (2) to obtain P(wi U C02) > P{t^i) + P{<^2) and since P{uji U 
UJ2) = P_{i^i) = a, we note that the given P has a unique coherent extension 
on UI2, P(w2) = P_e{^2) = Pui^2) = 0. Since P_{oJ2) is determined by the 
assessment on Sint, we consider computing P* starting from = Sint U 

{W2}. We therefore add P_{ui2) = 0 and, to be able to apply the transformation 
to the new assignment, Pe{<^2) = 1 — 2a to the initial assessment (note that 
the initial assignment does not entail a unique P(w2)). The new Pm is no longer 
a coherent probability (Pm(o’i) = PmijjJi U W2) = Pm(‘^2) = 5 ~ hence 

Pm is not additive), and we may compute P* noting that the global minimum 
of (fi = (P*(wi) - i)^ -I- (P*{co2) - \ + a)'^ + (P*(wi) -I- P*{u)2) - 5)^ satisfies 
(9), (11) and therefore gives the required P*, which is such that P*(wi) = 
P*{uj2) = . Summoning up, we obtain: 

P*{oji) = P*{u!i U0J2) = operating on Sint', 

P*{toi) = P*(wi U W2) = operating on Sfj^j,. 

To get an idea of the difference, let a = 0. P is then vague, and its most 
intuitive transformation appears to be the (restriction on Sint) of the uniform 
probability Punif- However P* is equal to Punif when working on S^j^j,, not 
when using Sint- D 

Clearly, the example above is not sufficient to infer what implications of a 
given assessment should be necessarily considered before running the transfor- 
mation. For instance, it is not even simple in general to detect a priori (i.e. 
without computing upper and lower extensions) those events, if any, which al- 
low a unique extension of P, and this task may be not necessarily simpler than 
just computing Pe fo'' A ^ Sint, ^ G S'*. 

5 Conclusions 

The contribution of this paper is twofold. On one hand, we have discussed sev- 
eral existing transformations and obtained new results about their properties and 
limitations when applied to the case of coherent imprecise probabilities; this kind 
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of analysis had not been previously considered in the literature. On the other 
hand, we have proposed an alternative transformation, applicable also to partial 
assignments, showing that it features some desirable properties: it preserves the 
consistency criterion and tends to remove imprecision in a symmetric way, gives 
a unique solution, returns the uniform probability from a vague assignment on 
2^, is computable in polynomial time. We then discussed basic problems con- 
cerning the alternative between direct vs. indirect application of transformations 
on partial assignments: this question deserves further investigation, as well as 
some aspects of the use of uncertainty invariant transformations with imprecise 
probabilities. 
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Abstract. We introduce a set of transformations on the set of all probability 
distributions over a finite state space, and show that these transformations are 
the only ones that preserve certain elementary probabilistic relationships. This 
result provides a new perspective on a variety of prohabilistic inference problems 
in which invariance considerations play a role. Two particular applications we 
consider in this paper are the development of an equivariance-hased approach to 
the problem of measure selection, and a new justification for Haldane’s prior as 
the distribution that encodes prior ignorance about the parameter of a multinomial 
distribution. 



1 Introduction 

Many rationality principles for probabilistic and statistical inference are based on con- 
siderations of indifference and symmetry. An early expression of such a principle is 
Laplace’s principle of insufficient reason: “One regards two events as equally probable 
when one can see no reason that would make one more probable than the other, because, 
even though there is an unequal possibility between them, we know not which way, and 
this uncertainty makes us look on each as if it were as probable as the of/ter”(Laplace, 
Collected Works vol. VIII, cited after [3]). Principles of indifference only lead to straight- 
forward rules for probability assessments when the task is to assign probabilities to a 
finite number of different alternatives, none of which is distinguished from the others 
by any information we have. In this case all alternatives will have to he assigned equal 
probabilities. Such a formalization of indifference by equiprobability becomes notori- 
ously problematic when from state spaces of finitely many alternatives we turn to infinite 
state spaces: on countably infinite sets no uniform probability distributions exist, and on 
uncountably infinite sets the concept of uniformity becomes ambiguous (as evidenced 
by the famous Bertrand’s paradox [6,19]). 

On (uncountably) infinite state spaces concepts of uniformity or indifference have to 
be formalized on the basis of certain transformations of the state space: two sets of states 
are to be considered equiprobable, if one can be transformed into the other using some 
natural transformation t. This, of course, raises the sticky question what transformations 
are to be considered as natural and probability-preserving. However, for a given state 
space, and a given class of probabilistic inference tasks, it often is possible to identify 
natural transformation, so that the solution to the inference tasks (which, in particular, can 
be probability assessments) should be invariant under the transformations. The widely 
accepted resolution of Bertrand’s paradox, for example, is based on such considerations 
of invariance under certain transformations. 
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In this paper we are concerned with probabilistic inference problems that pertain to 
probability distributions on finite state spaces, which are by far the most widely used type 
of distributions used for probabilistic modelling in artificial intelligence. As indicated 
above, when dealing with finite state spaces there does not seem to be any problem 
of capturing indifference principles with equiprobability. However, even though the 
underlying space of alternatives may be finite, the object of our study very often is 
the infinite set of probability distributions on that space, i.e. for the state space S = 
{si, . . . , s„} the (n — 1) -dimensional probability poly tope 

= {{pi, ■ ■ ■ ,Pn) G K" I G [0, l],Y^Pt = 1}. 

i 

The objective of this paper now can be formulated as follows: we investigate what 
natural transformations there exist of Z\”, such that inference problems that pertain to 
Z\" should be solved in a way that is invariant under these transformations. In Sect. 2 
we identify a unique class of transformations that can be regarded as most natural in 
that they alone preserve certain relevant relationships between points of Z\". In Sects. 3 
and 4 we apply this result to the problems of measure selection and choice of Bayesian 
priors, respectively. 

An extended version of this paper containing the proofs of theorems is available as 

[9]. 

2 Representation Theorem 

The nature of the result we present in this section can best be explained by an analogy: 
suppose, for the sake of the argument, that the set of probability distributions we are 
concerned with is parameterized by the whole Euclidean space M", rather than the subset 
Z\". Suppose, too, that all inputs and outputs for a given type of inference problem 
consist of objects (e.g. points, convex subsets, ... ) in K”. In most cases, one would 
then probably require of a rational solution to the inference problem that it does not 
depend on the choice of the coordinate system; specifically, if all inputs are transformed 
by a translation, i.e. by adding some constant offset r G K”, then the outputs computed 
for the transformed inputs should be just the outputs computed for the original inputs, 
also translated by r: 

sol{i + r) = sol{i) + r , (1) 

where i stands for the inputs and sol for the solution of an inference problem. Con- 
dition (1) expresses an equivariance principle: when the problem is transformed in a 
certain way, then so should be its solution (not to be confused with invariance principles 
according to which certain things should be unaffected by a transformation). 

The question we now address is the following: what simple, canonical transforma- 
tions of the set Z\" exist, so that for inference problems whose inputs and outputs are 
objects in Z\" one would require an equivariance property analogous to (1)? Intuitively, 
we are looking for transformations of Z\" that can be seen as merely a change of co- 
ordinate system, and that leave all relevant geometric structures intact. The following 
definition collects some key concepts we will use. 




52 



M. Jaeger 



Definition 1. A transformation of a set S is any bijective mapping t of S onto itself. We 
often write ts rather than t{s). For a probability distribution p = (pi, . . . ,p„) G Z\” 
the set {i G {1, • ■ • ,n} \ Pi >0} is called the set of support of p, denoted support(p). 
A transformation t of is said to 

- preserve cardinalities of support if for all p: \ support(p) | = | support(fp) | 

- preserve sets of support if for all p: support(p) = support(fp). 

A distribution p is called a mixture of p' and p" if there exists A G [0, 1] such that 
p = Xp' + (1 — X)p" (in other words, p is a convex combination of p' and p"). A 
transformation t is said to 

- preserve mixtures if for all p,p' ,p”: if p is a mixture of p' and p" , then tp is a 

mixture oftp' and tp". 

The set of support of a distribution p G A" can be seen as its most fundamental 
feature: it identifies the subset of states that are to be considered as possible at all, 
and thus identifies the relevant state space (as opposed to the formal state space S', 
which may contain states Si that are effectively ruled out by p with pi = 0). When the 
association of the components of a distribution p with the elements of the state space S = 
{si , . . . , s„} is fixed, then p and p' with different sets of support represent completely 
incompatible probabilistic models that would not be transformed into one another by 
a natural transformation. In this case, therefore, one would require a transformation to 
preserve sets of support. 

A permutation of Z\" is a transformation that maps (pi, . . . ,p„) to (p,r(i)) ■ • • ) 
p,r(n))> where tt is a permutation of {1, . . . , n}. Permutations preserve cardinalities of 
support, but not sets of support. Permutations of A" are transformations that are required 
to preserve the semantics of the elements of Z\” after a reordering of the state space S: 
if S is reordered according to a permutation tt, then p and irp are the same probability 
distribution on S. Apart from this particular need for permutations, they do not seem to 
have any role as a meaningful transformation of Z\". 

That a distribution p is a mixture of p' and p” is an elementary probabilistic rela- 
tion between the three distributions. It expresses the fact that the probabilistic model p 
can arise as an approximation to a finer model that would distinguish the two distinct 
distributions p' and p" on S, each of which is appropriate in a separate context. For 
instance, p' and p” might be the distributions on S' = {jam, heavy traffic, light traffic} 
that represent the travel conditions on weekdays and weekends, respectively. A mixture 
of the two then will represent the probabilities of travel conditions when no distinction 
is made between the different days of the week. 

That a transformation preserves mixtures, thus, is a natural requirement that it does 
not destroy elementary probabilistic relationships. Obviously, preservation of mixtures 
immediately implies preservation of convexity, i.e. if t preserves mixtures and A is a 
convex subset of Z\", then tA also is convex. 

We now introduce the class of transformations that we will be concerned with in the 
rest of this paper. We denote with M+ the set of positive real numbers. 
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Definition 2. Let r = (n, . . . , r„) G (K+)”. Define for p = (pi, . . . ,p„) G Z\” 

n 

trip) ■= (riPl, ■ ■ ■ ,rnPn)/y^JiPi- 

i=l 

Also letTn := {t,. | r G (K+)”}. 

Note that we have tr = tr' if r' is obtained from r by multiplying each component 
with a constant a > 0. We can now formulate our main result. 

Theorem 1. Let n > 3 and t be a transformation of 

(i) t preserves sets of support and mixtures ifft G T„. 

(ii) t preserves cardinalities of support and mixtures ifft = t' o -k for some permutation 
7T and some t' G T„. 

The statements (i) and (ii) do not hold for n = 2: is just the interval [0, 1], and 

every monotone bijection of [0, 1] satisfies (i) and (ii). A weaker form of this theorem 
was already reported in [8]. The proof of the theorem closely follows the proof of the 
related representation theorem for collineations in projective geometry. The following 
example illustrates how transformations f G T„ can arise in practice. 

Example 1. In a study of commuter habits it is undertaken to estimate the relative use 
of buses, private cars and bicycles as a means of transportation. To this end, a group 
of research assistants is sent out one day to perform a traffic count on a number of 
main roads into the city. They are given count sheets and short written instructions. Two 
different sets of instructions were produced in the preparation phase of the study: the first 
set advised the assistants to make one mark for every bus, car, and bicycle, respectively, 
in the appropriate column of the count sheet. The second (more challenging) set of 
instructions specified to make as many marks as there are actually people travelling in 
(respectively on) the observed vehicles. By accident, some of the assistants were handed 
instructions of the first kind, others those of the second kind. 

Assume that on all roads being watched in the study, the average number of people 
travelling in a bus, car, or on a bicycle is the same, e.g. 10, 1.5, and 1.01, respectively. 
Also assume that the number of vehicles observed on each road is so large, that the 
actually observed numbers are very close to these averages. 

Suppose, now, that we are more interested in the relative frequency of bus, car and 
bicycle use, rather than in absolute counts. Suppose, too, that we prefer the numbers that 
would have been produced by the use of the second set of instructions. If, then, an assistant 
hands in counts that were produced using the first set of instructions, and that show fre- 
quencies / = ifi, f 2 , f 3 ) G for the three modes of transportation, then we obtain the 
frequencies we really want by applying the transformation tr with r = (10, 1.5, 1.01). 
Conversely, if we prefer the first set of instructions, and are given frequencies generated 
by the second, we can transform them using r' = (1/10, 1/1.5, 1/1.01). 

This example gives rise to a more general interpretation of transformations in T„ as 
analogues in discrete settings to rescalings, or changes of units of measurements, in a 
domain of continuous observables. 
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3 Equivariant Measure Selection 

A fundamental probabilistic inference problem is the problem of measure selection: given 
some incomplete information about the true distribution p on S, what is the best rational 
hypothesis for the precise value of pi This question takes on somewhat different aspects, 
depending on whether p is a statistical, observable probability, or a subjective degree 
of belief. In the hrst case, the “true” p describes actual long-run frequencies, which, in 
principle, given sufficient time and experimental resources, one could determine exactly. 
In the case of subjective probability, the “true” p is a rational belief state that an ideal 
intelligent agent would arrive at hy properly taking into account all its actual, incomplete 
knowledge. 

For statistical probabilities the process of measure selection can be seen as a pre- 
diction on the outcome of experiments that, for some reason, one is unable to actually 
conduct. For subjective probabilities measure selection can be seen as an introspective 
process of refining one’s belief state. A first question then is whether the formal rules 
for measure selection should be the same in these two different contexts, and to which 
of the two scenarios our subsequent considerations pertain. 

Following earlier suggestions of a frequentist basis for subjective probability [16, 
1], this author holds that subjective probability is ultimately grounded in empirical 
observation, hence statistical probability [7]. In particular, in [7] the process of subjective 
measure selection is interpreted as a process very similar to statistical measure selection, 
namely a prediction on the outcome of hypothetical experiments (which, however, here 
even unlimited experimental resources may not permit us to carry out in practice). From 
this point of view, then, formal principles of measure selection will have to be the same 
for subjective and statistical probabilities, and our subsequent considerations apply to 
both cases. We note, however, that Paris [12] holds an opposing view, and sees no reason 
why his rationality principles for measure selection, which were developed for subjective 
probability, should also apply to statistical probability. On the other hand, in support of 
our own position, it may be remarked that the measure selection principles Shore and 
Johnson [18] postulate are very similar to those of Paris and Vencovska [15], but they 
were formulated with statistical probabilities in mind. 

There are several ways how incomplete information about p can be represented. 
One common way is to identify incomplete information with some subset A of Z\”: A is 
then regarded as the set of probability distributions p that are to be considered possible 
candidates for being the true distribution. Often A is assumed to be a closed and convex 
subset of Z\”. This, in particular, will be the case when the incomplete information is 
given by a set of linear constraints on p. In that case, A is the solution set of linear 
constraints, i.e. a polytope. 

Example 2. (continuation of example 1) One of the research assistants has lost his count 
sheet on his way home. Unwilling to discard the data from the road watched by this 
assistant, the project leader tries to extract some information about the counts that the 
assistant might remember. The assistant is able to say that he observed at least 10 times 
as many cars as buses, and at least 5 times as many cars as buses and bicycles combined. 
The only way to enter the observation from this particular road into the study, however, 
is in the form of accurate relative frequencies of bus, car, and bicycle use. To this end. 
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the project leader has to make a best guess of the actual frequencies based on the linear 
constraints given to him by the assistant. 

Common formulations of the measure selection problem now are: define a selection 
function sel that maps closed and convex subsets A of Z\" (or, alternatively: polytopes 
in Z\"; or: sets of linear constraints on p) to distribntions sel{A) G A. 

The most widely favored solution to the measure selection problem is the entropy 
maximization rnle: define selme (A) to be the distribution p 'mA that has maximal entropy 
(for closed and convex A this is well-defined). Axiomatic jusfificafions for this selection 
rule are given in [18,15]. Both these works postulate a number of formal principles that 
a selection rnle shonld obey, and then proceed to show that entropy maximization is the 
only rnle satisfying all the principles. Paris [13] argues that all these principles in essence 
are just expressions of one more general underlying principle, which is expressed by 
an informal statement (or slogan) by van Fraassen [19]: Essentially similar problems 
should have essentially similar solutions. 

In spite of its mathematical sonnd derivation, entropy maximization does exhibit 
some behaviors that appear counterintuitive to many (see [8] for two illustrative exam- 
ples). Often this counterintuitive behavior is due to the fact that the maximum entropy 
rule has a strong bias towards the uniform distribution u = (1/n, . . . , 1/n). As tt is 
the element in Z\" with globally maximal entropy, u will be selected whenever u G A. 
Consider, for example, Fig. l(i) and (ii). Shown are two different subsets A and A' of 
A^. Both contain it, and therefore sel^eiA) = selme{A') = u. While none of Paris’ 
rationality principles explicitly demands that u should be selected whenever possible, 
there is one principle that directly implies the following for the sets depicted in Fig. 1 : 
assuming that sel{A) = u, and realizing that A' is a subset of A, one shonld also have 
sel{A') = u. This is an instance of what Paris [14] calls the obstinacy principle: for 
any A, A' with A' C A and sel{A) G A' it is required that sel(A') = sel{A). The 
intuitive justification for this is that additional information (i.e. information that lim- 
its the previously considered distribution A to A') that is consistent with the previons 
defanlt selection (i.e. sel{A) G A') shonld not lead us to revise this default selection. 
While qnite convincing from a default reasoning perspective (in fact, it is a version of 
Gabbay’s [2] restricted monotonicity principle), it is not entirely clear that this principle 
is an expression of the van Fraassen slogan. Indeed, at least from a geometric point of 
view, there does seem to exist little similarity between the two problems given by A and 
A', and thus the requirement that they should have similar solutions (or even the same 
solution) hardly seems a necessary conseqnence of the van Fraassen slogan. 

An alternative selection rnle that avoids some of the shortcomings of sel^e is the 
center of mass selection rule selcm- selcm{A) is defined as fhe cenfer of mass of A. With 
selcm one avoids the bias towards u, and, more generally, the bias of selme towards points 
on the bonndary of the input set A is reversed towards an exclusive preference for points 
in the interior of A. A great part of the intuitive appeal of selcm is probably owed to the 
fact that it satisfies (1), i.e. if is franslation-equivarianf. 

Arguing that translations are not the right transformations to consider for Z\", how- 
ever, we would prefer selection rules that are T„-equivariant, i.e. for all A for which sel 
is to be defined, and all tr GTn'. 
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Fig. 1. Maximum Entropy and Tn- equivariant selection 



sel{trA) = trSel{A). (2) 

This, we would claim, is the pertinent (and succinct) formalization of the van Fraassen 
slogan for the measure selection problem. In fact, van Fraassen [19], after giving the 
informal slogan, proceeds to explain it further as a general equivariance principle of 
the form (1) and (2). The question, thus, is not so much whether this slogan is best 
captured as an equivariance requirement, but with which class of transformations the 
equivariance principle is to be instantiated. Interpreting theorem 1 as an identification 
of the transformations in T„ as the most “similarity preserving” transformations of Z\", 
we arrive at our answer that T„-equivariance is the principle we require. 

Figure l(iii)-(v) illustrates the T„ -equivariance principle: shown are three different 
transformations A\,A 2 , A 3 of a polytope defined by three linear constraints, and the 
corresponding transformations p^,p 2 ,P 3 of one distinguished element inside the Ai. 
T„ -equivariance now demands that sel{Ai) = Pi ^ sel{A 2 ) = P 2 ^ sel{A 3 ) = pg. 

Example 3. (continuation of example 2) Assume that the unlucky assistant in example 2 
was given instructions of the first type, and that he collected his data accordingly. If, 
instead, he had been given instructions of the second type, then the frequencies on the lost 
count sheet would have been frequencies f' = trf, where f are the actual frequencies 
on the lost sheet, and tj. is as in example 1 . The partial information he would then have 
been able to give also would have taken a different form. For instance, he might then 
have stated that he observed at least 6 times as many cars as buses, and at least 4.5 times 
as many cars as buses and bicycles combined. 

One can show [8] that under very natural modelling assumptions, there corresponds 
to the transformation tr on Z\” a dual transformation tr on the space of linear constraints, 
such that stating a constraint c for p corresponds to stating the constraint tpC for trP- 
The crucial assumption is consistency preservation, which, in our example, means that 
a constraint c the research assistant will state when the frequencies on the lost count 
sheet are f is consistent for f (i.e. satisfied by /) iff the constraint c' he would give for 
frequencies f' is consistent for f'. The transformation ir can also be characterized by 




A Representation Theorem and Applications 



57 



the condition: for all sets of constraints c 

SoKJrC) = trSol{c), 
where Sol denotes the solution set. 

When the project leader uses a T„-equivariant selection rule for reconstructing the 
true frequencies from the information he is given, then the following two approaches will 
lead to the same solution, whatever set of instructions this particular assistant was using: 
1 : first infer the actual frequencies observed by the assistant by applying the selection 
rule to the given constraints, and then transform to the preferred type of frequencies. 
2: first transform the given constraints so as to have them refer to the preferred type of 
frequencies (knowing that this should be done by applying the transformation ), and 
then apply the selection rule. 

r„-equivariance imposes no restriction on what sel{Ai) should be for any single Ai 
in Fig. 1 . It only determines how the selections for the different Ai should be related. 
It thus is far from providing a unique selection rule, like the rationality principles of 
Paris and Vencovska [15]. On the other hand, we have not yet shown that T„-equivariant 
selection rules even exist. In the remainder of this section we investigate the feasibility 
of defining T„-equivariant selection rules, without making any attempts to find the best 
or most rational ones. 

From (2) one immediately derives a limitation of possible T„-equivariant selection 
rules: let A = Z\" in (2). Then trA = A for every tr G T„, and equivariance demands 
that trSel{A) = sel{A) for all tr, i.e. sel{A) has to be a fixpoint under all transforma- 
tions. The only elements of Z\" that have this property are the n vertices Vi, . . . , v„, 
where Vi is the distribution that assigns unit probability to Sj G S. Clearly a rule with 
sel{A^) = Vi for any particular i would be completely arbitrary, and could not be ar- 
gued to follow any rationality principles (more technically, such a rule would not be 
permutation equivariant, which is another equivariance property one would demand in 
order to deal appropriately with reorderings of the state space, as discussed in Sect. 2). 

Similar problems arise whenever sel is to be applied to some A C Z\” that is invariant 
under some transformations of T„. To evade these difficulties, we focus in the following 
on sets that are not fixpoints under any transformations tr (this restriction can be lifted 
by allowing selection rules that may also return subsets of A, rather than unique points 
in A). Let A denote the class of all A C Z\" with trA ^ A for all tr G Tn. One can 
show that A contains (among many others) all closed sets A that lie in the interior of Z\", 
i.e. support{p) = {!,... , n} for all p G In the following example a T„ -equivariant 
selection rule is constructed for all convex A € A. This particular rule may not be a 
serious candidate for a best or most rational equivariant selection rule. However, it does 
have some intuitive appeal, and the method by which it is constructed illustrates a general 
strategy by which T„-equivariant selection rules can be constructed. 

Example 4. Let A^ denote the set of all convex A G A. On A^ an equivalence relation 
~ is defined by 

A^ A' 3tr G Tn ■. A' = trA. 

The equivalence class orb{A) := {A' \ A' ~ A} (= {trA \ tr G Tn}) is called 
the orbit of A (these are standard definitions). It is easy to verify that for A G A also 
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orb{A) C A, and that for every A' G orb(A) there is a unique tr G Tn with A' = trA 
(here transformations are unique, hut as observed above, this does not imply that the 
parameter r representing the transformation is unique). 

Suppose that sel{A) = p = (pi, . . . ,Pn). With r = (1/pi, . . . , 1/pn) then trP — 
u, and by equivariance sel{trA) = u. It follows that in every orbit there must be some 
set A' with sel{A') = u. On the other hand, if sel{A') = u, then this uniquely defines 
sel{A) for all A in the orbit of A'\ sel{A) = p, where p = trU with tr the unique 
transformation with trA' = A. One thus sees that the definition of an equivariant 
selection rule is equivalent to choosing for each orbit in A'' a representative A for 
which sel(A') = u shall hold. 

One can show that for each A G A'' there exists exactly one A G orb{A) for which 
u is the center of mass of A . Combining the intuitive center-of-mass selection rule with 
the principle of T„-equivariance, we thus arrive at the T„ -equivariant center-of-mass 
selection rule: selequiv-cm{A) = p iff A = trA , u is the center of mass of A , and 

p = trU. 



4 Noninformative Priors 



Bayesian statistical inference requires that a prior probability distribution is specified 
on the set of parameters that determines a particular probability model. Herein lies the 
advantage of Bayesian methods, because this prior can encode domain knowledge that 
one has obtained before any data was observed. Often, however, one would like to 
choose a prior distribution that represents the absence of any knowledge: an ignorant or 
noninformative prior. The set Z\" is the parameter set for the multinomial probability 
model (assuming some sample size N to be given). The question of what distribution 
on Z\" represents a state of ignorance about this model has received much attention, but 
no conclusive answer seems to exist. 

Three possible solutions that most often are considered are: the uniform distribution, 
i.e. the distribution that has a constant density c with respect to Lebesgue measure, Jef- 
freys’ prior, which is given by the density c ^ (where c is a suitable normalizing 
constant), and Haldane’s prior, given by density YliP7^- Haldane’s prior (so named 
because it seems to have first been suggested in [4]) is an improper prior, i.e. it has an 
infinite integral over Z\”. All three distributions are Dirichlet distributions with param- 
eters(l,... ,l),(l/2,... ,1/2), and (0, . . . , 0), respectively (in the case of Haldane’s 
distribution, the usual definition of a Dirichlet distribution has to be extended so as to 
allow the parameters (0, . . . , 0)). Schafer [17] considers all Dirichlet distributions with 
parameters (a, ... ,a) for 0 < a < 1 as possible candidates for a noninformative prior. 

The justifications for identifying any particular distribution as the appropriate non- 
informative prior are typically based on invariance arguments: generally speaking, ig- 
norance is argued to be invariant under certain problem transformations, and so the 
noninformative prior should be invariant under such problem transformations. There are 
different types of problem transformations one can consider, each leading to a differ- 
ent concept of invariance, and often leading to different results as to what constitutes a 
noninformative prior (see [5] for a systematic overview). In particular, there exist strong 
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invariance-based arguments both for Jeffreys’ prior [11], and for Haldane’s prior [10, 
20]. In the following, we present additional arguments in support of Haldane’s prior. 

Example 5. (continuation of example 3) Assume that the true, long-term relative fre- 
quencies of hus, car, and bicycle use are the same on all roads at which the traffic count 
is conducted (under both counting methods). Then the counts obtained in the study are 
multinomial samples determined by a parameter /* G if the first set of instructions 
is used, and G A^ if the second set of instructions is used. Suppose the project leader, 
before seeing any counts, feels completely unable to make any predictions on the results 
of the counts, i.e. he is completely ignorant about the parameters /*. 

When the samples are large (i.e. a great number of vehicles are observed on every 
road), then the observed frequencies f obtained using instructions of type i are expected 
to be very close to the true parameter f* . The prior probability Pr assigned to a subset 
A C Z\" then can he identified with a prior expectation of finding in the actual counts 
relative frequencies f G AAf this prior expectation is to express complete ignorance, 
then it must be the same for both sampling methods: being told by the first assistant 
returning with his counts that he had been using instructions of type 2 will have no 
influence on the project leader’s expectations regarding the frequencies on this assistant’s 
count sheet. In particular, merely seeing the counts handed in by this assistant will give 
the project leader no clue as to which instructions were used by this assistant. 

The parameters f* are related hy = trfl, where tr is as in example 1 . Having the 

same prior belief about as about fl means that for every A C A^ one has Pr{A) = 
Pr{trA).A noninformative prior, thus, should be invariant under the transformation tr- 
As the relation between fl and might also be given by some other transformation in 
T„, this invariance should actually hold for all these transformations. 

This example shows that invariance under -transformations is a natural require- 
ment for a noninformative prior. The next theorem states that this invariance property 
only holds for Haldane’s prior. In the formulation of the theorem a little care has to be 
taken in dealing with the boundary of Z\", where the density of Haldane’s prior is not 
defined. We therefore restrict the statement of the theorem to the prior on the interior of 
Z\”, denoted intA^. 

Theorem 2. Let Pr be a measure on intZ\” with Pr(intZ\”) > 0 and Pr(A) < oofor all 
compact subsets A o/intZ\”. Pr is invariant under all transformations tr G iff Pr 
has a density with respect to Lebesgue measure of the form c p~^ with some constant 

c > 0. 

It is instructive to compare the justification given to Haldane’s prior hy this theorem 
with the justification given by Jaynes [10]. Jaynes gives an intuitive interpretation of a 
noninformative prior as a distribution of beliefs about the true value of p that one would 
find in “a population in a state of total confusion”: an individual i in the population 
believes the true value of p to be p^ G 2\". The mixture of beliefs one finds in a 
population whose individuals base their beliefs on “different and conflicting information” 
corresponds to a noninformative prior on Z\" . Supposing, now, that to all members of this 
population a new piece of evidence is given, and each individual changes its belief about 
p by conditioning on this new evidence, then a new distribution of beliefs is obtained. 
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By a suitable formalization of this scenario, Jaynes shows that a single individual’s 
transition from an original belief 9 to the new belief 9' is given by 0' = a9/ {1 — 9 + a9) 
(Jaynes only considers the binary case, where 9 G [0, 1] takes the role of our p G Z\”). 
This can easily be seen as a transformation from our group T 2 . Jaynes’ argument now 
is that a collective state of total confusion will remain to be one of total confusion even 
after the new evidence has been assimilated by everyone, and so the belief distribution 
about 9 in the population must be invariant under the transformation 9 i-G- 9'. 

This justification, thus, derives a transformation of in a concrete scenario in 
which it seems intuitively reasonable to argue that a noninformative prior should be 
invariant under these transformations. This is similar to our argument for the invariance 
of a noninformative prior under the transformation tr in example 5. Justifications of 
Haldane’s (or any other) prior that are based on such specific scenarios, however, always 
leave the possibility open that similarly intuitive scenarios can be constructed which 
lead to other types of transformations, and hence to invariance-based justifications for 
other priors as noninformative. Theorems 1 and 2 together provide a perhaps more robust 
justification of Haldane’s prior: any justification for a different prior which is based on 
invariance arguments under transformations of Z\” must use transformations that do not 
have the conservation properties of definition 1, and therefore will tend to be less natural 
than the transformations on which the justification of Haldane’s prior is based. 



5 Conclusions 



Many probabilistic inference problems that are characterized by a lack of information 
have to be solved on the basis of considerations of symmetries and invariances. These 
symmetries and invariances, in turn, can be defined in terms of transformations of the 
mathematical objects one encounters in the given type of inference problem. 

The representation theorem we have derived provides a strong argument that in 
inference problems whose objects are elements and subsets of Z\", one should pay 
particular attention to invariances (and equivariances) under the transformations T„. 
These transformations can be seen as the analogue in the space Z\” of translations in the 
space K”. 

One should be particularly aware of the fact that it usually does not make sense 
to simply restrict symmetry and invariance concepts that are appropriate in the space 
K" to the subset Z\”. A case in point is the problem of noninformative priors. In K” 
Lebesgue measure is the canonical choice for an (improper) noninformative prior, be- 
cause its invariance under translations makes it the unique (up to a constant) “uniform” 
distribution. Restricted to Z\”, however, this distinction of Lebesgue measure does not 
carry much weight, as translations are not a meaningful transformation of Z\". Our re- 
sults indicate that the choice of Haldane’s prior for Z\” is much more in line with the 
choice of Lebesgue measure on R", than the choice of the “uniform” distribution, i.e. 
Lebesgue measure restricted to Z\". 

In a similar vein, we have conjectured in Sect. 3 that some of the intuitive appeal 
of the center-of-mass selection rule is its equivariance under translations. Again, how- 
ever, translations are not the right transformations to consider in this context, and one 
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therefore should aim to construct T„-equivariant selection rules, as, for example, the 
T„-equivariant modification of center-of-mass. 

An interesting open question is how many of Paris and Vencovska’s [15] rationality 
principles can he reconciled with T„-equivariance. As the combination of all uniquely 
identifies maximum entropy selection, there must always be some that are violated by 
T„-equivariant selection rules. Clearly the obstinacy principle is rather at odds with T„- 
equivariance (though it is not immediately obvious that the two really are inconsistent). 
Can one find selection rules that satisfy most (or all) principles except obstinacy? 
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Abstract. We investigate a simple modal logic of probability with a 
unary modal operator expressing that a proposition is more probable 
than its negation. Such an operator is not closed under conjunction, and 
its modal logic is therefore non-normal. Within this framework we study 
the relation of probability with other modal concepts: belief and action. 



1 Introduction 

Several researchers have investigated modal logics of probability. Some have 
added probability measures to possible worlds semantics, most prominently Fa- 
gin, Halpern and colleagues. [1]. They use modal operators of knowledge, ICcj) 
expressing that the agent knows that </>, and they introduce modal operators of 
the kind w{4>) > b expressing that “according to the agent, formula (j) holds with 
probability at least 6” . 

Others have studied the properties of comparative probability, following 
Kraft, Pratt, and Seidenberg, and Segerberg. They use a relation (j) > ij) (that can 
also be viewed as a binary modal construction) expressing “(j) is more probable 
than ')/)”. 

Only few have studied a still more qualitative notion, viz. the modal logic of 
constructions of the kind V4> expressing that 4> is more probable than (or at 
least as probable as ^4>). Among those are Hamblin [2], Burgess [3], and T. Fine 
[4]. Halpern and colleagues have studied the similar notion of likelihood [5,6]. 
Also related is research on modal logics allowing to count the accessible worlds 
[7,8]. One can also read V(j) as “probability of (j) is high”, and interpret “high” 
as “greater than 6”, for 1 > 6 > 0.5. We then basically get the same account as 
for b = 0.5.^ 

^ As suggested by one of the reviewers, another option is to interpret P<j} not as a 
two- valued modal proposition but as a many-valued modal proposition, as it is done 
in [9, Chapter 8] and [10]. There, the truth degree of V4> is taken as Prob{(j>), so the 
bigger is the probability of <j), the ‘more true’ is the proposition V4>- One can then 
express that (j> is more probable for the agent than ->4> by a non-classical implication 

— >• V4>- 
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Probably one of the reasons for the lack of interest in such approaches is 
that the corresponding logical systems are very poor, and do not allow to obtain 
completeness results w.r.t. the underlying probability measures.^ 

We here investigate the logic of the modal operator V . We start by analyzing 
its properties, in particular in what concerns the interplay with the notion of 
belief. Contrarily to comparative possibility, such an operator is not closed under 
conjunction, and therefore its modal logic is non-normal [14]. 

We then turn to semantics. While probability distributions over sets of acces- 
sible worlds are helpful to explain modal constructions such as V(j), it is known 
that they do not allow complete axiomatizations. We here study a semantics 
that is closer to the set of properties that we have put forward. Our models are 
minimal models in the sense of [14], that are based on neighborhood functions 
instead of probability distributions. The logic is a non-normal, monotonic modal 
logic. ^ 

Within this framework our aim is to study the relation of probability with 
other modal concepts such as belief and action. We propose principles for the 
interplay between action, belief, and probability, and formulate successor state 
axioms for both belief and probability. While there is a lot of work on proba- 
bilistic accounts of belief and action, as far as we are aware there is no similar 
work relating modal probability to belief and action. 

2 Preliminaries 

For the time being we do not consider interactions between several agents, and 
therefore we only consider a single agent. 



2.1 Atomic Formulas, Atomic Actions 

We have a set of atomic formulas Atm = {p, g, . . . }. Our running example will be 
in terms of playing dice; we thus consider atomic formulas di,d, 2 , ■ ■ ■ , respectively 
expressing “the dice shows 1”, etc. 

We have set of atomic actions Act = {a, /?,...}. In our example we have the 
throw action of throwing the dice, and the actions observei, observe 2 , ... of the 
agent observing that the dice shows 1, etc. 

Actions are not necessarily executed by the agent under concern, but may be 
executed by other agents or by nature. (So we might as well speak about events 
instead of actions.) 

We could have considered complex actions, but for the sake of simplicity we 
shall not do so here. 

^ Note that things are simpler if we do not take probability theory but possibility 
theory: As shown in [11,12], Lewis’ operator of comparative possibility [13] provides 
a complete axiomatization of qualitative possibility relations. 

® Hence our semantics is rather far away from probabilities. This might be felt to be 
at odds with intuitions, but as a matter of fact what we have done is to exactly 
capture all that can be formally said about the property V of being probable. 
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From these ingredients complex formulas will be built together with modal 
operators in the standard way. 



2.2 Modal Operators 

We have a standard doxastic modal operator B, and the formula Bcj) is read 
“the agent believes that (j>^\ or “(j) is true for the agent”. For example -•BcIq 
expresses that the agent does not believe that some dice shows “6” . The formula 
B{di V ^2 V ^3 V ^4 V ds V de) expresses that the agent believes the dice shows 
one of 1, 2, 3, 4, 5, or 6. 

Moreover we have a modal operator V where V4> is read “4> is probable for 
the agent”. The dual -'V->(j) expresses that <j) is not improbable. (This operator 
has been considered primitive in some papers in the literature.) For example, 
V{di V d 2 V da V d 4 ) expresses that it is probable for the agent that the dice shows 
one of 1, 2, 3, or 4. -•'Pd^, expresses that it is improbable for the agent that the 
dice shows “6” . 

Finally, for every action a G Act we have a dynamic logic operator [a]. 
The formula [a]4> is read “(j) holds after every execution of a”. For example 
-'[throw]->de expresses that the dice may show 6 after the throwing action. 
-•P[throw]de expresses that this is improbable for the agent. [throw]V~'dQ ex- 
presses that after throwing the dice it is probable for the agent that it did not 
fall 6. [throw][observee]Bde expresses that after throwing the dice and observing 
that it fell 6 the agent believes that it fell 6. 



2.3 Relations Agreeing with a Probability Measure 

V can also be viewed as a relation on formulas. Let Prob be any subjective 
probability measure defined on formulas that is associated to the agent. When 
it holds that 

V(j) iff Prob{4>) > Prob{->(j)) 
we say that V agrees with Prob. 



2.4 Hypotheses about Action 

We make some hypotheses about actions and their perception by the agent. They 
permit to simplify the theory. 



Public Action Occurrences. We suppose that the agent perceives action 
occurrences completely and correctly. For example whenever a dice is thrown 
the agent is aware of that, and whenever the agent believes a dice is thrown then 
indeed such an action has occurred. (One might imagine that action occurrences 
are publicly announced to all agents.) 
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Public Action Laws. We suppose that the agent knows the laws governing 
the actions. Hence the agent knows that after throwing a dice the effect always is 
that 1, 2, 3, 4, 5, or 6 show up, and that 1 and 2 cannot show up simultaneously, 
etc. 

Non-informativity. We suppose that all actions are non-informative. Non- 
informative actions are actions which are not observed by the agent beyond 
their mere occurrence. In particular the agent does not observe the outcome of 
nondeterministic actions such as that of throwing a dice. Upon learning that such 
an action has occurred the agent updates his belief state: he computes the new 
belief state from the previous belief state and his knowledge about the action 
laws. Hence the new belief state neither depends on the state of the world before 
the action occurrence, nor on the state of the world after the action occurrence. 

In our example we suppose that the throw action is non informative: the 
agent throws the dice without observing the outcome. If the agent learns that 
the action of throwing a dice has been executed then he does not learn which 
side shows up. 

Clearly, the action observe of observing the outcome of the throw action is 
informative: the new belief state depends on the position of the dice in the real 
world. Other examples of informative actions are that of looking up a phone 
number, testing if a proposition is true, telling whether a proposition is true, 
etc. 

Nevertheless, the agent is not disconnected from the world: he may learn that 
some proposition is true (i.e. that some action of observing that some proposition 
has some value has occurred). For example, when he learns that it has been 
observed that the dice fell 6 (i.e. he learns that the action of observing 6 has 
been executed) then he is able to update his belief state accordingly. Indeed, 
the ohservei actions are non-informative according to our definition: when the 
agent learns that observei has occurred then he is able to update his belief state 
accordingly, and there is no need to further observation of the world. Other 
examples of noninformative actions are that of learning that the phone number 
of another agent is N , testing that a proposition is true (in the sense of Dynamic 
Logic tests), telling that a proposition is true, etc. 

3 Axioms for Probability 

In this section we give an axiomatization for V . 

The inference rule for V is 

ii (p ^ then Vp — >■ Vif (RM 7 ?) 



and the axioms are as follows: 



VT 

Vp — >■ -'V-'p 



(N-p) 

(Dp) 
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These axioms match those that have been put forward in the literature, 
e.g. those in [4]. As stated there, it seems that there are no other principles of 
probability that could be formulated using V. 

Clearly such an ax;iomatization is sound w.r.t. the intended reading: 

Theorem 1. Let Prob be any probability measure, and suppose the property V 
agrees with V , i.e. V(j) iff Prob {(j>) > Prob{->4>). Then V satisfies (RM-p), (Np), 
(Dp). 

Another way of expressing this is that whenever we define Vfihy Prob{(j)) >0.5 
then V satisfies (RMp), (N-p), (Dp).^ 

Nevertheless, such an ax;iomatics is not complete w.r.t. probability measures. 
This will be illustrated in Section 9. 

4 Axioms for Belief 

Following [15] we suppose a standard KD45 axiomatics for B: we have the infer- 
ence rule 

if (j) ^ Ip then Bfi — >■ Btp 

and the following axioms: 

BT 

Bfi ^B^4> 

{BfiABip) ^ BifiAp}) 

Bfi BBfi 
^Bfi B^Bfi 

Hence the set of beliefs is closed under logical consequences, 
agents are aware of their beliefs and disbeliefs, i.e. we suppose 

5 Axioms Relating Belief and Probability 

What is the relation between V and B? According to our reading we should 
have that things that are believed are also probable for an agent, i.e. we expect 
B(f) — >■ Vfi to hold. The following main ax;iom will allow us to derive that: 

{BfiAVip) (C-MIX) 

Just as for the case of beliefs and disbeliefs, agents are aware of probabilities. 
This is expressed by the following two axioms: 

Vfi ^ BVfi (4-MIX) 

^ B^Vfi (5-MIX) 

Other principles of introspection for V will be derived from them in the sequel. 

This can be strengthened: if for some h > 0.5, Vp is defined as Prob{(j)) > b then 
P satisfies (RMp), (Np), (Dp). Note that thus our axioms do not conflict with the 
view of V(f> as “probability of (j> is high” . 



(RMb) 

(Nb) 

(Db) 

(Cb) 

(4b) 

(5b) 

and we suppose 
introspection. 
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5.1 Some Provable Formulas 

1. if h (/) = t/j then h V(l> = Pf/' 

This can be derived from (RM-p). 

2 . ^ 

This is an equivalent formulation of (Dp). 

3. 

By (Dp), V-L Then PT ^ T by (Np). 

4. 

This follows from (Np) and (C-MIX), putting tjj = T. 

5. h (5<j> A -P-V’) ^ A V>) 

From (C-MIX) together with (RMp) it follows h (Bcj) Aip) — >■ 

6 . ^V(j)^^B^(j) 

This follows from the next formula. 

7. h {V(j) A Tip) -A ^B^{(j) A 

This can be proved as follows: first, (C-MIX) together with (RMp) entails 
F {V(p A B^{(j) A ij))) -A V~<^. Then with (D) we get h {V(j) A B^{(j) A ip)) -A 
-•'Pip, from which the theorem follows by classical logic. 

8. h {B{(p -Alp) A V(p) -A Tip 

This follows from (C-MIX) together with (RMp). 

9. ^V(p = BV(p 

The direction follows from (4-MIX). The other direction follows from 
(5-MIX) and (Dp). 

10 . 'r ^V(p = B^V(p 

The direction follows from (5-MIX). The other direction follows from 
(4-MIX) and (Dp). 

11. ^V(p = VV(p 

The direction follows from (4-MIX). The other direction follows from 
(5-MIX) and (Dp). 

12. h -^V<p = V^Vcp 

The direction follows from (5-MIX) and h B<p -A V(p. The other direc- 
tion follows from (4-MIX) and (Dp). 

13. VB(p^V(p 

From F B<p -A V(p it follows that VB(p -A VV(p. And as we have seen, 
W(p -A V(p. 

5.2 Some Formulas That Cannot Be Proved 

The following formulas will not be valid in our semantics. Non-deducibility will 
follow from soundness. 

1 . V(p^B(p 

This would in fact identify V and B. 

2. V(p^VB(p 

Indeed, given that we expect V(p A -•Bcp to be consistent, such a formula 
would even lead to inconsistency (due to axioms (5b) and (C-MIX). 
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3. {Vcj) APiIj) ^ 'P{(j) A tp) 

This would clash with the probabilistic intuitions: Prob{(f>) > Prob{->(p) and 
Prob{ip) > Prob{-^'tp) does not imply Prob{4> A tp) > Prob{~^{p A tp)). 

4. {Vp AV{p ^ Ip)) ^ Vp) 

The reasons are the same as for the preceding formula. 

6 Axioms for Action 

We suppose the logic of action is just K. We therefore have the inference rule 

ii p ^ \p then [a\p -A [a\p (RMq.) 

and the following axioms: 

HT (N„) 

{[a]p A [a]^) -A [a]{p A ip) (Ca) 

Hence our logic of action is a simple version of dynamic logic [16]. 

7 Axioms Relating Belief and Action 

We recall that we have stated in Section 2.4 

— that the agent perceives action occurrences completely and correctly, 

— that he knows the laws governing the actions, and 

— that actions are non-informative, i.e. the agent does not learn about partic- 
ular effects of actions beyond what is stipulated in the action laws. 

As action effects are not observed, when the agent learns that the action of 
throwing a dice has been executed then he does not learn whether it fell 6 or 
not. 

In [17,18] we have argued that under these hypotheses the following axioms 
of “no forgetting” (NF) and “no learning” (NL) are plausible. They express that 
the agent’s new belief state only depends on the previous belief state and the 
action whose occurrence he has learned. 

(-'[a]T A [a\Bp) -A B[a]p (NLg) 

{^B[a]± A B[a]p) -A [a]Bp (NFb) 

For the “no learning” axiom, we must suppose that the action a is executable 
(else from [a]Bp) we could not deduce anything relevant). Similarly, for the “no 
forgetting” axiom we must suppose that the agent does not believe a to be 
inexecutable (else from B[a\p we could not deduce anything relevant). When the 
agent believes a to be inexecutable and nevertheless learns that it has occurred 
then he must revise his beliefs. In [19,18] it has been studied how AGM style 
belief revision operations [20] can be integrated. We do not go into details here, 
and just note that both solutions can be added in a modular way. 
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(NFg) and (NLg) together are equivalent to 

(-■[aJ-L A — >■ {[a]B(j) = B[a\4>) 

Axioms having this form have been called successor state axioms in cognitive 
robotics, and it has been shown that (at least in the case of deterministic actions) 
they enable a proof technique called regression [21,22]. 

8 Axioms Relating Probability and Action 

Suppose before you learn that a dice has been thrown it is probable for the 
agent that the dice will not fall 6: V[throw]^dQ. When the agent learns that the 
dice-throwing action has been executed (without learning the outcome, cf. our 
hypotheses) then it is probable for him that the dice does not show 6. Therefore 
the following no-learning axiom for V is plausible for non-informative actions: 



(-H_L A [a\P(j)) V[a]<l) (NLp) 

The other way round, when it is probable for the agent that 6 shows up 
after throw then (as we have supposed that he does not observe the outcome 
of throwing) it was already probable for the agent that 6 would show up before 
learning that the action has been executed. This is expressed by the following 
no-forgetting axiom. 



{-nP[a]± A P[a](/>) ^ [a]V(l) (NF-p) 

Again, both axioms are conditioned by executability of a (respectively belief 
of executability of a). 

9 Semantics 

Actions are interpreted as transition systems: truth of a formula [alpha](f> in a 
state (alias possible world) means truth of (f in all states possibly resulting from 
the execution of a. 

Truth of the formula B(f> means truth of 4> in all worlds that are possible for 
the agent. 

In what concerns the formula Vf, the intuition is that to every possible world 
there is associated a probability measure over the set of epistemically accessible 
worlds, and that Prob{4>) > Prob{->4>). Sometimes the intuition is put forward 
that among the set of accessible worlds there are more worlds where (j) is true 
than worlds where <j) is false. We shall show in the sequel that such an explanation 
is misleading. 

A frame is a tuple {W, B, P, {Ra : a £ Act}} such that 
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— W is a, nonempty set of possible worlds 

— B : W — 2^ maps worlds to sets of worlds 

— P : W — 2‘‘ maps worlds to sets of sets of worlds 

— Ra ■ W — >■ 2^ maps worlds to sets of worlds, for every a G Act 

Thus for every possible world w G W, B{w) and Ra{w) are sets of accessible 
worlds as usual. 

By convention, for a set of possible worlds F C Vb we suppose Ra{V) = 
Unev etc. 

P{w) is a set of sets possible worlds. Although intuitively P collects ‘big’ 
subsets of B (in the sense that for V G P, V contains more elements than its 
complement w.r.t. W, W \V), there is no formal requirement reflecting this. 
Every frame must satisfy some constraints', for every w G W, 

(dg) B{w) yf 0 

(45b) if w' G B{w) then B{w') = B{w) 

(n-p) P{w) yf 0 

(dp) if Vi , E2 G P{w), Vi n E2 y^ 0 

(c-mix) if E G P{w) then V C B{w) 

(45-mix) if ic G B(w) then P(w') = P{w) 

(nf-nlB) if w' G Ra{w) and Ra{B{w)) yf 0 then B{w') = Ra{B{w)). 

(nf-nlp) if w' G Ra{w) then P{w') = {Ra{V) : V G P{w) and Ra{V) yf 0} 

As usual a model is a frame together with a valuation: M = {T,V), where 
V : Atm — ^ 2^ maps every atom to the set of worlds where it is true. To 
formulate the truth conditions we use the following abbreviation: 

= {w G W : M,w \= (j)} 

Then given a model M, the truth conditions are as usual for the operators of 
classical logic, plus: 

— M,w \= (j) A 4> G Atm and w gV{4>) 

— M,w\=B(j)A B{w) C \\(j)\\j^ 

— M,w \= V(j) if there is E G P{w) such that V C \4>\m 

— M,w'^ [a](j) if Ra{w) C \(j)\j^ 



9.1 An Example 

Let us give an example. It will at the same time illustrate that the intuition of 
P{w) ‘collecting more than 50% of the accessible worlds’ is misleading. 

Let the agent learn in wq that a dice has been thrown. Then we might 
suppose that after throw the situation is described by a possible world w where 
B{w) = {ui,... jUg} such that Vi G V{dj) iff t = j, and where P{w) is the 
set of all subsets of B{w) containing more than half of the worlds in B{w), i.e. 
P{w) = {V C B{w) : card{ V) > 3}. 

Now suppose we are in a game where a player is entitled to throw his dice a 
second time if (and only if) his first throw was a 6. Let throwifd describe that 
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deterministic conditional action. We have thus Rthrowif6{v&) = {feii--- 
with Ug. G y {dj) iff t = j. For f < 5, we have Rthrowife (fi) = {v'i} with v[ &V (dj) 
iff Vi G V{dj). According to our semantics, the situation after a completed turn 
can be described by a possible world w' where 

Rthrowifdi^) ~ {^ } 

R{w ) = Rthrowifdiy^ b* Ui<5 ^throwif6 • 

— The neighborhood P{w') of w' contains in particular {v{,V 2 , v'^, v'^\, although 
this set contains much less than half of the worlds in B(w'). 



9.2 Soundness and Completeness 

Our axiomatization is sound w.r.t. the present neighborhood semantics: 

Theorem 2. If (f> is provable from our axioms and inference rules, then (p is 
valid in neighborhood semantics. 

We conjecture that we have completeness, too. The only nonstandard part 
of the Henkin proof concerns the neighborhood semantics: In principle, for all 
w G W and V G P{w) our axiom (C-MIX) only enforces that there is some 
V' G P{w) such that W C y n B{w). What we would like our model to satisfy 
is that y G B{w). In order to guarantee that frames must be transformed in the 
following way: 

Lemma 1. Let {P,V) be any model satisfying all the constraints except (c- 
mix). If T \= (C-MIX) and {T,V),w \= (j> then there is a model {T' ,V) such 
that {T' ,V) satisfies the constraints and such that (P',V'),w |= 4>. 

Proof We define W' = W, V' = V, B' = B, R'^ = i?„, and P'{w) = {V G 
P{w) : V C B{w) } As for every V G P{w) there is some V G P{w) such that 
y' C y r\B{w), P{w) is nonempty. Moreover, we can prove by induction that for 
every w GW and every formula fr, we have {P, V),w \= ip iS {P', V'),w |= ip. 



9.3 The Relation with Probability Measures 

In any case, our neighborhood semantics differs from the standard semantics 
in terms of probability measures. The latter is not complete w.r.t. probability 
measures, as announced in Section 3. 

Theorem 3 ([4]). .• Let Atm = {a,b, c, d,e, f, g}. Take a model Ai where 

- W = 2^*"^ 

- for every w G W, N(w) = {efg,abg,adf,bde,ace,cdg,bcf}, where efg is 
used to denote {e, /,(?}, etc. 

- V(p) = {wGW :pGW} 

Then A4 satisfies the above constraints on neighborhood frames, but their is no 
agreeing probability measure. 
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10 Conclusion 

We have investigated a ‘very qualitative’ notion of probability, that of a formula 
being more probable than its negation. We have presented the axioms governing 
its relation with belief, and we have proposed principles for the interplay between 
action, belief, and probability. 

While there is a lot of work on probabilistic accounts of belief and action, as 
far as we are aware there is no similar work relating modal probability to belief 
and action. 

While we provide a probabilistic account of belief, we do not consider prob- 
abilistic action here. Therefore (and just as in the logics of belief and action) 
uncertainty can only diminish as actions occur, and on the long run probabili- 
ties will converge towards belief, in the sense that we will have P{w) = {i?(w)}. 
Just as in the case of shrinking belief states, this is unsatisfactory. In future 
work we shall introduce misperception (as already done for beliefs in [18]) and 
probabilistic actions in order to improve the account. 



Acknowledgements. Thanks to the three reviewers, all of which have provided 
comments that hopefully enabled us to improve our exposition. 
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Abstract. New standards in document representation, like for example 
SGML, XML, and MPEG-7, compel Information Retrieval to design and 
implement models and tools to index, retrieve and present documents 
according to the given document structure. The paper presents the de- 
sign of an Information Retrieval system for multimedia structured doc- 
uments, like for example journal articles, e-books, and MPEG-7 videos. 
The system is based on Bayesian Networks, since this class of mathe- 
matical models enable to represent and quantify the relations between 
the structural components of the document. Some preliminary results on 
the system implementation are also presented. 



1 Introduction 

Information Retrieval (IR) systems are powerful and effective tools for access- 
ing documents by content. A user specifies the required content using a query, 
often consisting of a natural language expression. Documents estimated to be 
relevant to the user query are presented to the user through an interface. New 
standards in multimedia document representation compel IR to design and im- 
plement models and tools to index, retrieve and present documents according to 
the given document structure. In fact, while standard IR treats documents as if 
they were atomic entities, modern IR needs to be able to deal with more elab- 
orate document representations, like for example documents written in SGML, 
HTML, XML or MPEG-7. These document representation formalisms enable 
to represent and describe documents said to be structured, that is documents 
whose content is organised around a well defined structure. Examples of these 
documents are books and textbooks, scientific articles, technical manuals, edu- 
cational videos, etc. This means that documents should no longer be considered 
as atomic entities, but as aggregates of interrelated objects that need to be in- 
dexed, retrieved, and presented both as a whole and separately, in relation to 
the user’s needs. In other words, given a query, an IR system must retrieve the 
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set of document components that are most relevant to this query, not just entire 
documents. 

In order to enable querying both content and structure an IR system needs 
to possess the necessary primitives to model effectively the document’s content 
and structure. Taking into account that Bayesian Networks (BNs) have been 
already successfully applied to build standard IR systems, we believe that they 
are also an appropriate tool to model both in a qualitative and quantitative way 
the content and structural relations of multimedia structured documents. 

In this paper we propose a BN model for structured document retrieval, 
which can be considered as an extension of a previously developed model to 
manage standard (non-structured) documents [1,6]. The rest of the paper is 
organized as follows: we begin in Sect. 2 with the preliminaries. In Sect. 3 we 
introduce the Bayesian network model for structured document retrieval, the 
assumptions that determine the network topology being considered, the details 
about probability distributions stored in the network, and the way in which we 
can efficiently use the network model for retrieval, by performing probabilistic 
inference. Section 4 shows preliminary experimental results obtained with the 
model, using a structured document test collection [9]. Finally, Sect. 5 contains 
the concluding remarks and some proposals for future research. 

2 Preliminaries 

Probabilistic models constitute an important kind of IR models, which have been 
widely used for a long time [5], because they offer a principled way to manage 
the uncertainty that naturally appears in many elements within this field. These 
models (and others, as the Vector Space model [15]) usually represent documents 
and queries by means of vectors of terms or keywords, which try to characterize 
their information content. Because these terms are not equally important, they 
are usually weighted to highlight their importance in the documents they belong 
to, as well as in the whole collection. The most common weighting schemes are 
[15] the term frequency, tfij, i.e., the number of times that the term appears 
in the document, and the inverse document frequency, idfi, of the term 
in the collection, idfi = where N is the number of documents in 

the collection, and Ui is the number of documents that contain the term. The 
combination of both weights, tfij ■ idfi, is also a common weighting scheme. 

2.1 Information Retrieval and Bayesian Networks: The Bayesian 
Network Model with Two Layers 

Bayesian networks have also been successfully applied in a variety of ways within 
the IR environment, as an extension/modification of probabilistic IR models [6, 
13,16]. We shall focus on a specific BN-based retrieval model, the Bayesian Net- 
work Retrieval Model with two layers (BNR-2) [1,6], because it will be the start- 
ing point of our proposal to deal with structured documents. 

The set of variables V in the BNR-2 model is composed of two different 
sets, V = TUT>: the set T = {Ti, . . . , Tm}, containing binary random variables 
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Fig. 1. Two-layered Bayesian network for the BNR-2 model 



representing the M terms in the glossary from a given collection, and the set 
T> = {Di , . . . , -Dtv}) corresponding also to binary random variables, represent- 
ing the N documents that compose the collection. We will use the notation Ti 
{Dj, respectively) to refer to the term (document, respectively) and also to its 
associated variable and node. A variable Dj has its domain in the set {dJ , d~j }, 
where dJ and respectively mean ‘the document Dj is not relevant’, and ‘the 
document Dj is relevant’ for a given query^. A variable Ti takes its values from 
the set {t~ , tf}, where in this case t~ stands for ‘the term Ti is not relevant’, 
and tf represents ‘the term Ti is relevant’^. To denote a generic, unspecified 
value of a term variable Ti or a document variable Dj, we will use lower-case 
letters, ti and dj. 

With respect to the topology of the network (see Fig. 1), there are arcs going 
from term nodes to those document nodes where these terms appear, and there 
are not arcs connecting pairs of either document nodes or term nodes. This means 
that terms are marginally independent among each other, and documents are 
conditionally independent given the terms that they contain. In this way, we get 
a network composed of two simple layers, the term and document subnetworks, 
with arcs only going from nodes in the first subnetwork to nodes in the second 
one. 

The probability distributions stored in each node of the BNR-2 model are 
computed as follows: For each term node we need a marginal probability dis- 
tribution, p{ti)] we use p{tf) = ^ and p{t~) = {M being the number 

of terms in the collection)^. For the document nodes we have to estimate the 
conditional probability distribution p{dj\pa{Dj)) for any configuration pa{Dj) 

^ A document is relevant for a given query if it satisfies the user’s information need 
expressed by means of this query. 

^ A term is relevant in the sense that the user believes that this term will appear in 
relevant documents. 

® Although these probabilities could also be estimated from the dataset, the unin- 
formed estimate proposed produces better results. 
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of Pa{Dj) (i.e., any assignment of values to all the variables in Pa{Dj)), where 
Pa{Dj) is the parent set of Dj (which coincides with the set of terms indexing 
document Dj). As a document node may have a high number of parents, the 
number of conditional probabilities that we need to estimate and store may be 
huge. Therefore, the BNR-2 model uses a specific canonical model to represent 
these conditional probabilities: 

p{d'j\pa{Dj)) = ^ w(T„Dj), p{dj\pa{Dj)) = 1 - p{dj\pa{Dj)) , (1) 

TieR{pa{Dj)) 

where R{pa{Dj)) = {Ti G Pa{Dj)\tf G pa{Dj)}, i.e., the set of terms in 
Pa{Dj) that are instantiated as relevant in the configuration pa{Dj); w(Ti, Dj) 
are weights verifying w{Ti,Dj) > 0 and '^Ti&Pa{Dj) '^{d^iiDj) < 1. So, the more 
terms are relevant in pa{Dj), the greater the probability of relevance of Dj. 

The BNR-2 model can be used to obtain a relevance value for each document 
given a query Q. Each term Ti in the query Q is considered as an evidence for the 
propagation process, and its value is fixed to . Then, the propagation process is 
run, thus obtaining the posterior probability of relevance of each document given 
that the terms in the query are also relevant, p{d^\Q). Later, the documents are 
sorted according to their corresponding probability and shown to the user. Tak- 
ing into account the number of nodes in the network {N + M) and the fact that, 
although its topology seems relatively simple, there are multiple pathways con- 
necting nodes as well as nodes with a great number of parents, general purpose 
inference algorithms cannot be applied due to efficiency considerations, even for 
small document collections. So, the BNR-2 model uses a tailored inference pro- 
cess, that computes the required probabilities very efficiently and ensures that 
the results are the same that those obtained using exact propagation in the entire 
network. The key result is stated in the following proposition [7]: 

Proposition 1. Let Dj be a binary variable in a Bayesian network having only 
binary variables as its parents. Assume an evidence Q d-separated from Dj by 
its parents. If p{d^^ \pa{D j)) is defined as in eg. (1), then 

p(d+|Q)= ^ w{T,,Dj)-p{tt\Q). (2) 

TiePa(Dj) 



Taking into account the topology of the term subnetwork, p{tf\Q) = 1 if 
Ti & Q and p{tf\Q) = ^ A Ti ^ Q, hence Eq. (2) becomes 

p{d+\Q)= w{Ti,Dj) + ^ ^ wiT^Dj). (3) 

Ti^Pa(Dj)nQ Ti^Pa{Dj)\Q 

Observe that eq. (3) also includes those terms in Pa{Dj) that are not in the 
query. This is due to the fact that the user has not established that terms outside 
the query are irrelevant, and therefore they contribute to the relevance of the 
document with their prior probability. 
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2.2 Structured Document Retrieval 

In IR the area of research dealing with structured documents is known as struc- 
tured document retrieval. A good survey of the state of the art of structured 
document retrieval can be found in [4] . The inclusion of the structure of a docu- 
ment in the indexing and retrieval process affects the design and implementation 
of the IR system in many ways. First of all, the indexing process must consider 
the structure in the appropriate way, so that users can search the collection 
both by content and structure. Secondly, the retrieval process should use both 
structure and content in the estimate of the relevance of documents. Finally, 
the interface and the whole interaction has to enable the user to make full use 
of the document structure. In fact, querying by content and structure can only 
be achieved if the user can specify in the query what he/she is looking for, and 
where this should be located in the required documents. The “what” involves 
the specification of the content, while the “where” is related to the structure of 
the documents. 

It has been recognised that the best approach to querying structured docu- 
ments is to let the user specify in the most natural way both the content and 
the structural requirements of the desired documents [4]. This can be achieved 
by letting the user specify the content requirement in a natural language query, 
while enabling the user to qualify the structural requirements through a graphical 
user interface. A GUI is well suited to show and let the user indicate structural 
elements of documents in the collection [17]. 

This paper addresses the issues related to the modelling of the retrieval of 
structured documents when the user does not explicitly specifies the structural 
requirements. In standard IR retrievable units are fixed, so only the entire docu- 
ment, or, sometimes, some pre-defined parts such as chapters or paragraphs con- 
stitute retrievable units. The structure of documents, often quite complex and 
consisting of a varying numbers of chapters, sections, tables, formulae, biblio- 
graphic items, etc., is therefore “flattened” and not exploited. Classical retrieval 
methods lack the possibility to interactively determine the size and the type 
of retrievable units that best suit an actual retrieval task or user preferences. 
Some IR researchers are aiming at developing retrieval models that dynamically 
return document components of varying complexity. A retrieval result may then 
consist of several entry points to a same document, corresponding to structural 
elements, whereby each entry point is weighted according to how it satisfies the 
query. Models proposed so far exploit the content and the structure of docu- 
ments to estimate the relevance of document components to queries, based on 
the aggregation of the estimated relevance of their related components. These 
models have been based on various theories, like for example fuzzy logic [3], 
Dempster-Shafer’s theory of evidence [10], probabilistic logic [2], and Bayesian 
inference [11]. What these models have in common is that the basic compo- 
nents of their retrieval function are variants of the standard IR term weighting 
schema, which combines term frequency with inverse document frequency, often 
normalised keeping into account document length. Evidence associated with the 
document structure is often encoded into one or both of these term weighting 
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functions. A somewhat different approach has been presented in [14], where ev- 
idence associated with the document structure is made explicit by introducing 
an “accessibility” dimension. This dimension measures the strength of the struc- 
tural relationship between document components: the stronger the relationship, 
the more impact has the content of a component in describing the content of 
its related components. Our approach is based on a similar view of structured 
document retrieval, where a quantitative model is used to compute the strength 
of the relations between structural elements. In fact, we use a BN to model these 
relations. A BN is a very powerful tool to capture these relations, with particular 
regards to hierarchically structured document. The next section contains a de- 
tailed presentation of our approach. Other approaches to structured document 
retrieval also based on BNs can be found in [8,11,12]. 



3 Prom Two-Layered to Multi-layered Bayesian Networks 
for Structured Document Retrieval 

To deal with structured document retrieval, we are going to assume that 
each document is composed of a hierarchical structure of I abstraction levels 
Cl, , Cl, each one representing a structural association of elements in the text. 
For instance, chapters, sections, subsections and paragraphs in the context of a 
general structured document collection, or scenes, shots, and frames in MPEG-7 
videos. The level in which the document itself is included will be noted as level 
1 {Cl), and the more specific level as C[. 

Each level contains structural units, i.e., single elements as Chapter 4, Sub- 
section 4.5, Shot 54, and so on. Each one of these structural units will be noted 
as Uij, where i is the identifier of that unit in the level j. The number of 
structural units contained in each level Cj is represented by \Cj\. Therefore, 
Cj = {Ui^j , . . . , C/|£^| j}. The units are organised according to the actual struc- 
ture of the document: Every unit Uij at level j, except the unit at level j = 1 
(i.e., the complete document Di = Ui^i), is contained in only one unit Uz(ij)j-i 
of the lower level j — 1^, Uij C Uz(i^j)j-i. Therefore, each structured document 
may be represented as a tree (an example is displayed in Fig. 2). 




Fig. 2. A structured document 

z{i,j) is a function that returns the index of the unit in level i — 1 where the unit 
with index i in level j belongs to. 
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Now, we shall describe the Bayesian network used by our Bayesian Network 
Retrieval model for Structured Documents (BNR-SD). 

3.1 Network Topology 

Taking into account the topology of the BNR-2 model for standard retrieval 
(see Fig. 1), it seems to us that the natural extension to deal with structured 
documents is to connect the term nodes with the structural units Ui^i , . . . , 
of the upper level Li . Therefore, only the units in level Ci will be indexed, having 
associated several terms describing their content (see Fig. 3). 

From a graphical point of view, our Bayesian network will contain two dif- 
ferent types of nodes, those associated to structural units, and those related to 
terms, so that V = TUU, where U = . As in the BNR-2 model, each node 

represents a binary random variable: Uij takes its values in the set 
representing that the unit is not relevant and is relevant, respectively; a term 
variable Ti is treated exactly as in the BNR-2 model. The independence relation- 
ships that we assume in this case are of the same nature that those considered in 
the BNR-2 model: terms are marginally independent among each other, and the 
structural units are conditionally independent given the terms that they contain. 




Level 1 



Level 2 



Level 3 



Fig. 3. From an indexed document to an indexed structured document 




These assumptions, together with the hierarchical structure of the docu- 
ments, completely determine the topology of the Bayesian network with I + 1 
layers, where the arcs go from term nodes to structural units in level I, and from 
units in level j to units in level j — 1, J = 2, . . . , L So, the network is characterized 
by the following parent sets for each type of node: 

- VTfe G r, Pa(Tfc) = 0. 

- yUi^i G Cl, Pa{Ui^i) = {Tfe G T| Ui^i is indexed by Tfc}. 

- Vj = 1, . . . , I — 1, € Cj, P a(Uij) = {Ukj+i G £j+i I Ukj+i Q Uij}. 

An example of this multi-layer BN is depicted in Fig. 4, for I = 3. 



3.2 Conditional Probabilities 

The following task is the assessment of the (conditional) probability distribu- 
tions: 
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• Term nodes T^: they store the same marginal probabilities p(tk) as in the 
BNR-2 model. 

• Structural units Uij: to compute p{ui^i\pa{Ui^i)) and p{uij\pa{Uij)), j ^ I, we 
use the same kind of canonical model considered for the relationships between 
terms and documents in the BNR-2 model (see Eq. (1)): 

P{ut,l\pa{U^,l)) = (4) 

neRipa(Ui,i)) 



p{ufj\pa{Uij)) = w{Uh,j+i,Uij), (5) 

Uh,j+l&R{pa{Uij)) 

where in this case w{T}~, Ui^i) is a weight associated to each term indexing the 
unit Ui^i, w{Uh,j+i, Uij) is a weight measuring the importance of the unit Uhj+i 
within with w{Tk,Ui^i) > 0, w{Uh,j+i,Uij) > 0, Y.T^(^Pa(Ui,i) = 

1 and Y,Uh j+iaPa{Ui,j)^(^h,j+i,Uij) = 1®. In either case R{pa{Uij)) is the 
subset of parents of Uij (terms for j = I, units in level j + I for j yf 1) that are 
instantiated as relevant in the configuration pa{Uij). 

To conclude the specification of the conditional probabilites in the network, 
we have to give values to the weights w{Tk,Ui^i) and w{Uh,j+i,Uij). Let us 
introduce some additional notation: for any unit Uij € lA, let A(JJij) = {Tk € 
T I Tfe is an ancestor of Uij}, i.e., A(Uij) is the set of terms that are included in 
the unit Uij^. Let tfk,c be the frequency of the term Tk (number of times that 
Tfc occurs) in the set of terms C and idfk be the inverse document frequency of Tk 
in the whole collection. We shall use the weighting scheme p{Tk, C) = tfk,C'idfk- 
We define 



'iUiq G Cl, VTfc G Pa{Ui^i), w{Tk,Ui^i) 



p{Tk,A{U,^i)) 

PC^h, A{Ui^i)) 



(6) 



® Notice that we use here the symbol = instead of <. The only reason for this restriction 
is to ease some implementation details of the model. 

® Notice that, although a unit Ui^j in level j lis not connected directly to any term, 
it contains all the terms indexing structural units in level I that are included in Uij. 
Notice also that AfUi^i) = PaiUi^i). 
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= G Cj, VUh,j+i G Pa{Uij), 

l-^T^€A(Uij) 



p(Tk,A{Uhj+i)) 



( 7 ) 



p{Tk,A(Uij)) 



Observe that the weights in eq. (6) verify that ’^Tk&Pa(Ui = 1. In 

fact, they are only the classical tfidf weights, normalized to sum up one. It is 
also important to notice that, as tfk,BuC = tfk,B + tfk,Cj then p{Tk,B U C) = 
p{Tk,B) + p{Tk,C). Moreover, A{Uij) = .) A(C/hj+i). Taking 

into account these facts, it is also clear that the weights in eq. (7) verify that 
J2uh,j+i&Pa{Uij)'^(^h,j+i,Uij) = 1. The weights w{Uh,j+i,Uij) measure, in 
some sense, the proportion of the content of the unit Uij which can be attributed 
to each one of its components. 



3.3 Inference 

The inference process that we have to carry out in order to use the BNR-SD 
model is, given a query Q, to compute the posterior probabilities of relevance of 
all the structural units, p{ufj\Q). Although this computation may be difficult 
in a general case, in our case all the conditional probabilities have been assessed 
using the canonical model in eq. (1) and only terms nodes are instantiated (so 
that only a top-down inference is required). In this context, having in mind the 
result in Proposition 1, the inference process can be carried out very efficiently, 
in the following way: 

— For the structural units in level Ci, the posterior probabilities are (as in the 
BNR-2 model): 

pHi\q)= E + E (8) 

TkePa{Ui,i)nQ TkePa{Ui,i)\Q 



— For the structural units in level Cj, j 1: 

P{ut,j\Q)= E w{Uhj+i,U,,j) ■ p{u^j^^\Q) . (9) 

Uh,j + l^Pa(Ui,j) 



Therefore, we can compute the required probabilities on a level-by-level basis, 
starting from level I and going down to level 1. 

It can be easily proven that, using the proposed tfidf weighting scheme, the 
BNR-SD model is equivalent to multiple BNR-2 models, one per level Cj. In 
other words, if we consider each unit Uij in Cj as a single document (indexed 
by all the terms in A{Uip)) and build a BNR-2 model, then the posterior prob- 
abilities of relevance are the same as in the BNR-SD model. The advantage of 
BNR-SD is that it can manage all the levels simultaneously, and is much more 
efficient in terms of storage requirements and running time than multiple BNR-2 
models. 




A Multi-layered Bayesian Network Model for Structured Document Retrieval 



83 



3.4 Model Implementation 

The BNR-SD model has been implemented using the Lemur Toolkit, a soft- 
ware written in C-I-+ designed to develop new applications on Information 
Retrieval and Language Modelling. This package (available at http://www- 
2. cs.cmu.edu/~lemur/) offers a wide range of classes that cover almost all the 
tasks required in IR. 

Our implementation uses an inverted file, i.e., a data structure containing, 
for each term in the collection, the structural units in level I where it occurs (the 
term’s children in the network) . The evaluation of units in level I is carried out by 
accumulating, for each unit Ui^i, the weights w(Tk, Ui^i) of those terms belonging 
to the query by which they have been indexed. To speed up the retrieval, all 
the weights w(Tk,Uij) (eq. 6) have been precomputed at indexing time and 
stored in a binary random access file. When the accumulation process is finished, 
for each unit Ui^i sharing terms with the query (i.e., Pa{Uij) 0 Q 0) we 
have an accumulator Si^i = '^Tk&Pa(Ui then we can compute 

the value P{ufi\Q) in eq. (8) as P{ufi\Q) = Si^i + Notice that the 

units containing no query term do not need to be evaluated, and their posterior 
probability is the same as their prior, P{ufi\Q) = 

With respect to the structural units from the rest of layers, the only infor- 
mation needed is also stored in a binary random access file, containing, for each 
unit Uh,j+i, that one where it is contained (its unique child in the network), 
Uij, and the corresponding weight w{Uhj+i,Uij) (eq. 7), which are also pre- 
computed at indexing time. In order to evaluate the units in level j /, those 
units in level j + 1 evaluated in a previous stage will play the same role as query 
terms do in the evaluation of units in level 1: for each unit Uij containing units 
Uh,j+i previously evaluated (this happens if A{Uhj+i) C\Q ^ we use two 
accumulators, one for the weights w{Uh,j+i, Uij) and the other for the products 
w{Uhj+i, Uij)-p{u'^ At the end of the accumulation process, for each one 

of these units Uij we have two accumulators, Sij = j+isQi j Uij) 

and SPij = w{Uhj+i,Uij) ■ p{u^^j+i\Q), where = {Uhj+i G 

Pa{Uij) \A{Uh,j+i) n Q y^ 0}. Then we can compute the value P{ufj\Q) in 
eq. (9) as P{ufJQ) = SPij + j^(l - S'^j). 

4 Preliminary Experiments 

Our BNR-SD model has been tested using a collection of structured documents, 
marked up in XML, containing 37 William Shakespeare’s plays [9]. A play has 
been considered structured in acts, scenes and speeches (so that ^ = 4), and may 
contain also epilogues and prologues. Speeches have been the only structural 
units indexed using Lemur. The total number of unique terms contained in these 
units is 14019, and the total number of structural units taken into account is 
32022. With respect to the queries, the collection is distributed with 43 queries, 
with their corresponding relevance judgements. From these 43 queries, the 35 
which are content-only queries were selected for our experiments. 
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Table 1. Average precision values for the experiments with the BNR-SD model 



Recall 


AP-PLAY 


AP-ACT 


AP-SCENE 


AP-SPEECH 


AP-All 


0 


0.9207 


0.7797 


0.5092 


0.1957 


0.2019 


0.1 


0.9207 


0.7797 


0.5065 


0.1368 


0.1346 


0.2 


0.9207 


0.7797 


0.4600 


0.1100 


0.1008 


0.3 


0.9207 


0.7738 


0.4279 


0.0846 


0.0719 


0.4 


0.9207 


0.7518 


0.4088 


0.0721 


0.0632 


0.5 


0.9207 


0.7318 


0.3982 


0.0434 


0.0437 


0.6 


0.9207 


0.6755 


0.3663 


0.0362 


0.0380 


0.7 


0.9207 


0.6512 


0.3220 


0.0201 


0.0288 


0.8 


0.9207 


0.6453 


0.3054 


0.0138 


0.0189 


0.9 


0.9207 


0.6253 


0.2580 


0.0079 


0.0107 


1 


0.9207 


0.6253 


0.2484 


0.0025 


0.0059 


AVP-llp 


0.9207 


0.7108 


0.3828 


0.0657 


0.0653 



As a way of showing the new potential of retrieving structured documents, 
several experiments have been designed. Let us suppose that a user is interested 
in the structural units of a specific type that are relevant for each query (i.e., 
s/he selects a given granularity level). Therefore, four retrievals have been run 
for the set of queries: only retrieving plays, only acts, only scenes, prologues and 
epilogues, and finally, speeches. A last experiment tries to return to the user, in 
only one ranking, all the structural units ranked according to their relevance. 
Table 1 shows the average recall-precision values (using the 11 standard recall 
values) for the five experiments. The row AVP-llp shows the average precision 
for the 11 values of recall. The maximum number of units retrieved for each 
experiment has been fixed to 1000. 

An important fact to notice is that when the system offers a ranking with all 
the structural units, the performance is not very good. This behaviour is due to 
the fact that, according to the expressions used to compute the relevance of the 
units, the posterior probability of a play, for instance, is very small compared 
to that assigned to a speech. This implies that the lower level units, like plays 
or acts, for example, are located in the furthest positions in the ranking and 
therefore, never retrieved. After observing the ranking produced in the last two 
experiments, we noticed that there are a number of units, in this case speeches, 
that have a posterior probability equal to 1.0. The reason is that they are very 
short, perhaps one or two terms, occurring all of them in the query. As the 
weights are normalised to 1.0, the final relevance is very high and these units are 
placed on the top of the ranking but introducing some noise. This is other cause 
of the poor behaviour of the retrieval considering only speeches and all types of 
units as well. These facts suggests the convenience of including in our model a 
decision procedure to select the appropriate units to be retrieved. 

On the other hand, the effectiveness of the system is quite good for the first 
three experiments, where the objective is to retrieve larger units, containing more 
terms, as acts and scenes. However, it should be noticed that the effectiveness 
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decreases as the number of units involved in the retrieval increases and the 
number of terms per unit decreases. 

5 Concluding Remarks 

In this paper a Bayesian network-based model for structured document retrieval, 
BNR-SD, has been presented, together with some promising preliminary experi- 
ments with the structured test collection of Shakespeare’s plays. Our model can 
be extended/improved in several ways, and we plan to pursue some of them in 
the near future: 

— To incorporate to our network model a decision module, in order to select 
the appropriate structural units (the best entry points) that will be shown 
to the users, depending on their own preferences. 

~ To allow that structural units in levels different from I have associated spe- 
cific textual information (for example the title of a chapter or a section); to 
allow also direct relationships between units in non-consecutive levels of the 
hierarchy (e.g. paragraphs and chapters). These questions do not bring any 
technical complications, is only a matter of implementation. 

~ To include in our network model specific term relationships (as those in [7]) 
and/or document relationships (as those in [1]). Alternatively, we could also 
use Ontologies to model concepts and their relationships in a given domain 
of knowledge. 

— To permit our model to deal, not only with content-only queries, but also 
with structure-only and content-and-structure queries; to let the queries to 
include, in addition to terms, also structural units. 

— To apply our model, in combination with techniques for image analysis, to 
multimedia retrieval, particularly MPEG-7 videos. 
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Abstract. Qualitative probabilistic networks are designed for proba- 
bilistic inference in a qualitative way. They capture qualitative influ- 
ences between variables, but do not provide for indicating the strengths 
of these influences. As a result, trade-offs between conflicting influences 
remain unresolved upon inference. In this paper, we investigate the use 
of order-of-magnitude kappa values to capture strengths of influences in 
a qualitative network. We detail the use of these kappas upon inference, 
thereby providing for trade-off resolution. 



1 Introduction 

Qualitative probabilistic networks [1] and the kappa calculus [2] both provide for 
probabilistic reasoning in a qualitative way. A qualitative probabilistic network 
is basically a qualitative abstraction of a probabilistic network and similarly 
encodes variables and the probabilistic relationships between them in a directed 
acyclic graph. The encoded relationships represent influences on the probability 
distributions of variables and are summarised by a sign indicating the direction 
of change or shift (positive, negative, zero, or unknown) in the distribution 
of one variable occasioned by another. The kappa calculus offers a framework 
for reasoning with defeasible beliefs, where belief states are given by a ranking 
function that maps propositions into non-negative integers called kappa values. 
Kappa values, by means of a probabilistic interpretation [3], were previously 
used to abstract probabilistic network into so-called Kappa networks, where a 
network’s probabilities are abstracted into kappa values, which are easier to 
assess than precise probabilities and lead to more robust inference results [4,5]. 

Inference in Kappa networks is based on the use of kappa calculus and is in 
general of the same order of complexity as inference in probabilistic networks 
(NP-hard). In contrast, inference with a qualitative probabilistic network can 
be done efficiently by propagating and combining signs [6]. However, qualitative 
probabilistic networks, due to the high level of abstraction, do not provide for 
weighing influences with conflicting signs and, hence, do not provide for resolving 
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such trade-offs. Inference with a qualitative probabilistic network therefore often 
results in ambiguous signs that will spread throughout most of the network. 

Preventing ambiguous inference results is essential as qualitative networks 
can play an important role in the construction of quantitative probabilistic net- 
works for realistic applications [7] . Assessing the numerous required point prob- 
abilities for a probabilistic network is a hard task and typically performed only 
when the network’s digraph is considered robust. By first assessing signs for the 
modelled relationships, a qualitative network is obtained that allows for studying 
the inference behaviour of the projected quantitative network, prior to probabil- 
ity assessment. Ambiguous inference results in a qualitative network can to some 
extent be averted by, for example, introducing a notion of strength of influences. 
To this end, previous work partitions the set of qualitative influences into strong 
and weak influences [8]. In this paper, we investigate the combination of quali- 
tative probabilistic networks and kappa values. A novel approach to using kappa 
values allows us to distinguish several levels of strength of qualitative influences, 
thereby enabling the resolution of more trade-offs. 

This paper is organised as follows. Section 2 provides preliminaries concerning 
qualitative probabilistic networks; Sect. 3 details our use of kappa values to 
indicate strengths of influences. Section 4 presents an inference procedure for 
our kappa enhanced networks. The paper ends with some conclusions in Sect. 5. 

2 Qualitative Probabilistic Networks 

A probabilistic network, or Bayesian network, uniquely encodes a joint proba- 
bility distribution Pr over a set of statistical variables. A qualitative probabilistic 
network (QPN) can be viewed as a qualitative abstraction of such a network, 
similarly encoding statistical variables and probabilistic relationships between 
them in an acyclic directed graph G = (P(G), A(G)) [1]. Each node A G V{G) 
represents a variable, which, for ease of exposition, we assume to be binary, 
writing a for A = true and d for A = false. The set A{G) of arcs captures prob- 
abilistic independence between the variables. Where a quantitative probabilistic 
network associates conditional probability distributions with its digraph, a qual- 
itative probabilistic network specifies qualitative influences and synergies that 
capture shifts in the existing, but as of yet unknown (conditional) probability 
distributions. A qualitative influence between two nodes expresses how the val- 
ues of one node influence the probabilities of the values of the other node. For 
example, a positive qualitative influence along arc A ^ B of node A on node B, 
denoted B), expresses that observing a high value for A makes the higher 

value for B more likely, regardless of any other direct influences on B, that is, 
for a > a and any combination of values x for the set tt{B) \ {A} of (direct) 
predecessors of B other than A: 

Pr(6 I ax) — Pr(6 | dx) > 0. 

A negative qualitative influence S~ and a zero qualitative influence are de- 
fined analogously; if an influence is not monotonic or if it is unknown, it is called 
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ambiguous, denoted S'’. The definition of qualitative influence can be straight- 
forwardly generalised to an influence along a chain of nodes in G. 

A qualitative probabilistic network also includes product synergies [6], that 
capture the sign of the (intercausal) qualitative influence induced between the 
predecessors A and B of a node C upon its observation; an induced intercausal 
influence behaves as a regular qualitative influence. 

The set of qualitative influences exhibits various properties. The property 
of symmetry states that, if the network includes the influence S‘^(A, B), then it 
also includes S^{B, A), 6 G {+,—,0,?}. The transitivity property asserts that the 
signs of qualitative influences along a chain with no head-to-head nodes combine 
into a sign for a net influence with the (^-operator from Table 1. The property 
of composition asserts that the signs of multiple influences between nodes along 
parallel chains combine into a sign for a net influence with the ©-operator. Note 
that composition of two influences with conflicting signs, modelling a trade-off, 
results in an ambiguous sign, indicating that the trade-off cannot be resolved. 

For inference with a qualitative network an efficient algorithm, that builds on 
the properties of symmetry, transitivity, and composition of influences, is avail- 
able [6] and summarised in Fig. 1. The algorithm traces the effect of observing a 
value for one node on the other nodes by message-passing between neighbours. 
For each node, a node sign is determined, indicating the direction of change in 
its probability distribution occasioned by the new observation. Initial node signs 
equal ‘O’, and observations are entered as a ‘+’ for the observed value true or 
a ’ for the value false. Each node receiving a message updates its sign with 
the ©-operator and subsequently sends a message to each (induced) neighbour 
that is not independent of the observed node. The sign of this message is the 
©-product of the node’s (new) sign and the sign of the influence it traverses. 



procedure PropagateSign{from,to,messagesign): 

sign[to] sign[to] © messagesign, 
for each (induced) neighbour Vi of to 

do linksign sign of (induced) influence between to and U; 
messagesign sign[to] © linksign; 

if Vi / from and Vi ^ Observed and sign[Ui] A sign[Vi] © messagesign 
then PropagateSign(to, Fi , messagesign) 



Fig. 1. The sign- propagation algorithm 
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Fig. 2. The qualitative Antibiotics network 



This process is repeated throughout the network, until each node has changed 
its sign at most twice (once from ’0’ to or then only to 

Example 1. Consider the qualitative network from Fig. 2, representing a frag- 
ment of fictitious medical knowledge which pertains to the effects of taking 
antibiotics on a patient. Node A represents whether or not the patient takes 
antibiotics. Node T models whether or not the patient has typhoid fever and 
node D represents presence or absence of diarrhoea in the patient. Node F 
describes whether or not the patient’s bacterial flora composition has changed. 
Typhoid fever and bacterial flora change can both cause diarrhoea: (T, D) and 

S'+(F, D). Antibiotics can cure typhoid fever, S'” (A, T), but may also change the 
bacterial flora composition, S~^{A,F). 

We observe that a patient has taken antibiotics and enter the sign ‘-I-’ for 
node A. Node A propagates this sign to T, which receives ‘d-®— = — ’ and sends 
this to node D. Node D in turn receives ‘ — ® -I- = — ’ and does not pass on any 
sign. Node A also sends its sign to F, which receives ‘d-®-!- = -k’ and passes this 
on to node D. Node D then receives the additional sign ‘d- ® d- = d-’. The two 
signs for D are combined, resulting in the ambiguous ‘— © d- =?’; the modelled 
trade-off thus remains unresolved. □ 



3 Introducing a Notion of Strength into QPNs 

To provide for trade-off resolution in qualitative probabilistic networks, we in- 
troduce a notion of strength of qualitative influences using kappa values. 



3.1 Kappa Rankings and Their Interpretations 

The kappa calculus provides for a semi-qualitative approach to reasoning with 
uncertainty [2] [3]. In the kappa calculus, degrees of (un)certainty are expressed 
by a ranking k that maps propositions into non-negative integers such that 
K{true) = 0 and k(o V b) = min{/«(a), k ( 6 )}. For reasoning within the kappa 
calculus simple combination rules for manipulation of k- values exist. 

Kappa rankings can be interpreted as order-of-magnitude approximations 
of probabilities [3], allowing, for example, to compute posterior probabilities 
using kappa calculus. A probability Pr(a;) can be approximated by a polynomial 
written in terms of a (infinitesimal) base number 0 < e < 1. k{x) now represents 
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the order of magnitude of this polynomial. More formally, 

k{x) = n iff < Pr(a;) < e" . (1) 

Note that higher probabilities are associated with lower K-values; for example, 
k{x) = 0 if Pr(a;) = 1, and n{x) = oo iff Pr(a;) = 0. 

3.2 Using Kappas as Indicators of Strength 

We consider a qualitative probabilistic network with nodes A, B and X such 
that tt{B) = X U {A}. Let Ix{A, B) denote the influence of A on i? for a certain 
combination of values x for the set X. We recall that the sign 6 of the qualitative 
influence of A on B is defined as the sign of Pr(6 | ax) — Pr(6 | ax), for all x] 
the absolute values of these differences lie in the interval [0,1]. Analogous to 
equivalence (1), we define the k- value of an influence of A on i? for a certain x: 

k{Ix{A, B)) = n iff < | Pr(6 | ax) — Pr{b \ ax)\ < e". 

We then define the strength factor associated with the influence of A on B to 
be an interval [p, q] such that 

p > max k{Ix{A,B)) and 0 < g < min k{Ix{A,B)), 

X X 

and each k expresses an order of magnitude in terms of the same base. We 
associate strength factors with positive and negative influences; zero and am- 
biguous influences are treated as in regular qualitative probabilistic networks. 
The above definitions extend to chains of influences as well. The resulting net- 
work will be termed a kappa- enhanced qualitative probabilistic network and we 
write S^^’‘^\A, B) to denote a qualitative influence of node A on node B with 
sign S and strength factor [p, q] in such a network. 

Note that for a strength factor [p, q] we always have that p > q, where p is 
greater than or equal to the kappa value of the weakest possible influence and 
q is less than or equal to the kappa value of the strongest possible influence. 
The reason for allowing influences to pretend to be stronger or weaker than they 
are will become apparent. Note that for each influence [oo, 0] is a valid strength 
factor, but not a very informative one. 

We can now express strength of influences in a kappa-enhanced network in 
terms of the base e chosen for the network: the influence of node A on node B 
has strength factor [p, q] iff for all x 

gP+i ^ I Pi{b I ax) — Pr(6 | ax)| < e‘^ . 

Instead of capturing the influences between variables by using kappa values for 
probabilities, as is done in Kappa networks, we capture influences by associating 
kappa values with the arcs. A Kappa network requires a number of kappa values 
that is exponential in the number of parents for each node; our kappa-enhanced 
networks require only a number of kappa values that is linear in the number of 
parents for each node. 
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4 Inference in Kappa-Enhanced Networks 

Probabilistic inference in qualitative probabilistic networks builds on the prop- 
erties of symmetry, transitivity and composition of influences. In order to exploit 
the strength of influences upon inference in a kappa-enhanced network, we define 
new 0- and ©-operators. 



4.1 Kappa-Enhanced Transitive Combination 

To address the effect of multiplying two signs with strength factors in a kappa- 
enhanced network, we consider the left network fragment from Fig. 4. The frag- 
ment includes the chain of nodes A, B, C, with two qualitative influences between 
them; in addition, we take X = tt{B) \ {4.}, and Y = Tr(C') \ {B}. For the net 
influence of A on C, we now And by conditioning on B that 

Pr(c I axy) — Pr(c | axy) = (Pr(c | by) — Pr(c | by)) ■ (Pr(6 | ax) — Pr(b | ax)) (2) 

for any combination of values x for the set X and y for Y . Similar equations are 
found given other arc directions, as long as node B has at least one outgoing 
arc. Other influences of 4 on C than those shown are taken into account by the 
©-operator. 
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Fig. 3. The new (g)-operator for combining signs and strength factors 



Transitively combining influences amounts to multiplying differences in prob- 
ability, resulting in differences that are smaller than those multiplied; transitive 
combination therefore causes weakening of influences. This is also apparent from 
Fig. 3 which shows the table for the new ©-operator: upon transitive combina- 
tion, the strength factor shifts to higher kappa values, corresponding to weaker 
influences. From the table it is also readily seen that signs combine as in a reg- 
ular qualitative probabilistic network; the difference is just in the handling of 
the strength factors. We illustrate the combination of two positive influences; 
similar observations apply to other combinations. 



Proposition 1. Let A, B and C be as in the left fragment of Fig. 4, then 

SAp,<!](^A,B) a S'+[’'’"1(S,C') ^ 5'+[p+’'+h<?+d(^^C') , 
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Proof: Let X and Y be as in the left fragment of Fig. 4. Suppose B) 

and (i?, C). We now have for the network-associated base e that 

gP+i ^ I I and 

From Equation (2) for the net influences of node A on node C, we now find that 

^{p+r+l) + l ^ ^p +1 , ^r-Hl ^ p^(^ I ^^y^ _ p^(^ I -^y^ <£<?.£« = e<?+^ 

for any combination of values xy for the set X U F. As e > 0, we find that the 
resulting net influence of A on (7 is positive with strength [p-|-r-|-l,( 7 -|-s]. □ 

4.2 Kappa-Enhanced Parallel Composition 

For combining multiple qualitative influences between nodes along parallel 
chains, we provide the new ©-operator in Fig. 5, which takes strength factors into 
account. In addressing parallel composition we first assume that e is infinites- 
imal; the effect of a non-infinitesimal e on the ©-operator is discussed at the 
end of this section. We consider the right network fragment from Fig. 4, which 
includes the parallel chains A, C, and A, B, C, respectively, between the nodes 
A and C, and various qualitative influences; in addition, we take X = 7r(i?)\{A} 
and Y = tt{C) \ {A, B}. For the net influence of A on C along the two parallel 
chains, conditioning on B gives the following equation for any combination of 
values X for X and y for Y 

Pr(c I axy) — Pr(c | axy) = (Pr(c | aby) — Pr(c | ahy)) • Pr(6 | ax) 

— (Pr(c I aby) — Pr(c | aby)) ■ Pr(6 | ax) (3) 
+ Pr(c I aby) — Pr(c | aby) . 

Similar equations are found if arc directions are changed, as long as the fragment 
remains acyclic and B has at least one outgoing arc. 

Parallel composition of influences may result in a net influence of larger 
magnitude: the result of adding two positive or two negative influences is at 
least as strong as the strongest of the influences added. This observation is also 
apparent from Fig. 5: the minimum of kappa values represents a stronger net 
influence. On the other hand, adding conflicting influences may result in a net 
influence of smaller magnitude, which is also apparent from Fig. 5. 




Fig. 4. Two illustrative network fragments 
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As examples, we now illustrate the parallel composition of two positive in- 
fluences, and — more interesting in the light of resolving trade-offs — the com- 
position of a positive and a negative influence. Similar observations with respect 
to the sign and strength factor of a net influence apply to situations in which 
the signs of the influences are different from those discussed. 

Proposition 2. Let A and C he as in the right fragment of Fig. 4, then 

Proof: Let B, X and F be as in the right fragment of Fig. 4. Suppose that 
_^+b.9](^, C), and that the positive influence C) is 

composed of the influences S'+I’’ 1(A, B) and S'+I’’ ’® ^{B,C) such that r = 
r' -|- r" -|- 1 and s = s' + s" . Similar observations apply when these latter two 
influences are negative. We now have for the network-associated base e that 

gP-i-i ^ I dbiy) — Pr(c | dbix) < e"^, for all values bi of B, 
gi- +1 ^ I _ Pii^b I dx) < e® , and 
e’’ < Pr(c I Giby) — Pr(c | afby) < e® , for all values at of A. 

From Equation (3) for the net influence of node A on node C, we now And that 

Pr(c I axy) - Pr(c | dxy) > /+’'"+2 + gP+i = + gP+i > ^rain{r,p}+i ^ 

Pr(c I axy) — Pr(c | dxy) < -I- e® ''■® = e'^ + e®, 

for any combination of values xy for the set X \JY . The lower-bound for this 
difference is, for example, attained for Pr(& | dx) = 0 and Pr(6 | ax) = e’’ The 
upper-bound is attained, for example, for Pr(6 | ax) = 1 and Pr(6 | dx) = 1 — e® . 
In computing these bounds, we have exploited the available information with 
regard to the signs and strengths of the influences involved. 

For infinitesimal e the upper-bound -I- e® is approximated by ; the 

net influence is thus positive with strength factor [min{p, r}, min{(;, s}]. □ 

If two influences have conflicting signs, then one ’outweighs’ the other if its 
weakest effect is stronger than the other influence’s strongest effect. We adapt the 
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[u,v] = [min{p,r},min{<j,s}] 

a) +[p,q], ifp-f 1 < s; 

+ [oo,q], ifp<s; 

- [r,s], if r-l- 1 < <j; 

- [cxD,s], if r < 

?, otherwise 

b) see a) with -|- and — reversed 



Fig. 5. The new ©-operator for combining signs and strength factors (e infinitesimal) 
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safest and most conservative approach to combining conflicting influences, that 
is, by comparing the lower bound of the one influence with the upper bound of 
the other. Other interval comparison methods are however possible (see e.g. [9]). 

Proposition 3. Let A and C he as in the right fragment of Fig. 4, then 

(A, C) A (A, C) ^ (A, C) if p+Ks; 

if p<s; 
ifr+Kq; 

sJ°^’^\A,C) if r<q; 

S„^f(A,C) otherwise. 

Proof: Let B, X and Y be as in the right fragment of Fig. 4. Suppose 
^+b.9](^, C), and let the negative influence C) be 

composed of 1(A, B) and S'+I'’ 1(B, C) such that r = r' + r" + 1 and 

s = s' + s" . Similar observations apply when these latter signs are switched. 
From Equation (3), we now have for the network-associated base e that 

Pr(c I axy) — Pr(c | dxy) > — e" ■ e" = — e", and 

Pr(c I axy) — Pr(c | dxy) < — e’’ • e*" = e® — 

for any combination of values xy for X \JY . The lower-bound for the difference 

(not distance!) is attained, for example, for Pr(6 | ax) = 0 and Pr(6 | ax) = e® ; 

the upper-bound for the difference is attained, for example, for Pr(& | dx) = 1 
which enforces Pr(& | ax) < 1 — e'’ In computing these bounds, we have once 
again exploited the available information with regard to the signs and strengths 
of the influences involved. 

Now, if > e® then Pr(c | axy)—Pr(c | dxy) > 0 = e°°. Given infinitesimal 
e the lower-bound — e® is approximated by under the tighter constraint 
p + 1 < s. The constraint p -I- 1 < s also implies q < r -I- 1, giving an upper- 
bound of We conclude that the resulting influence is positive with 

strength factor [p, g] if p -|- 1 < s and strength factor [oo, g] if p < s. 

On the other hand, if > e'^ then Pr(c | axy) — Pr(c | dxy) < 0 = e°° . 
Given infinitesimal e the (negative!) upper-bound — e'’+^ + e*? is approximated 
by — e'’+i under the tighter constraint r -|- 1 < g. The constraint r -I- 1 < g also 
implies p -I- 1 > s, so we And a (negative) lower-bound of — e® -I- > — e®. We 

conclude that the resulting influence is negative. Taking the absolute values of 
the given bounds, we And a strength factor [r, s] if r-l-1 < g and [oo, s] if r < g. □ 

The Non-inflnitesimal Case. The ©-operator defined above explicitly uses 
the fact that our kappa values are order of magnitude approximations of differ- 
ences in probability by just taking into account the most significant e-term in 
determining the strength factor of the net influence. Such approximations are 
valid as long as e indeed adheres to the assumption that it is infinitesimal. In a 
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realistic problem domain, however, probabilities and even differences in proba- 
bility are hardly ever all very close to zero or one, and a non-infinitesimal e is 
required to distinguish different levels of strength. 

Although the inference algorithm sums only two signs with strength factors 
at a time, ultimately a sign and strength factor can be the result of a larger 
summation. If 1/e parallel chains to a single node are combined upon inference, 
the approximation used by the ©-operator will be an order of magnitude off, 
affecting not only the strength factor of the net influence (the interval becomes 
too ’tight’: the influence can be stronger or weaker than captured by the inter- 
val), but possibly its sign as well. For inference in a kappa-enhanced network in 
which the assumption of an infinitesimal e is violated, therefore, we have to per- 
form an additional operation. We have a choice between two types of operation, 
depending on whether or not the actual value of e is known. If e is unknown, 
this operation consists of ’broadening’ the interval an extra order of magnitude 
upon each sign addition: when composing two influences with the same sign, the 
occurrences of min{g, s} in Fig. 5 should be replaced by max{0, min{g, s} — 1} to 
obtain a true upper-bound, assuming that e < 0.5. Under this same assumption, 
when adding a positive and a negative influence, we find true lower-bounds by 
replacing in Fig. 5 each p and r in a) and b) by p + 1 and r + 1, respectively. If 
the actual value of e is known, the additional operation consists of performing 
the discussed correction only when necessary, that is, if a sign is composed (a 
multiple of) 1/e times. For this option, each sign needs to record how often it is 
summed during sign-propagation. 

The adaption of parallel composition for non-infinitesimal e leads to weaker, 
but at least correct, results. In correcting the upper- and lower-bounds of the 
strength factor, we have assumed that e < 0.5. This assumption seems reason- 
able, as each probability distribution has at most one probability larger than 
0.5, and differences between probabilities are therefore likely to be less than 0.5. 

4.3 Applying the Inference Algorithm 

The properties of transitivity and parallel composition of influences can, as ar- 
gued, be applied in a kappa-enhanced network. The property of symmetry holds 
for qualitative influences with respect to their sign, but not with respect to their 
strength. For an influence against the direction of an arc, we must therefore 
either use the default interval [oo, 0], or an explicitly specified strength factor. 

Using the new ©- and ©-operators, the sign-propagation algorithm for reg- 
ular qualitative probabilistic networks can now be applied to kappa-enhanced 
networks. Node-signs are again initialised to ‘O’; observations are once again en- 
tered as a ‘+’ or The strength factor associated with an observation is either 
a dummy interval [—1,0] (so as to cause no loss of information upon the first 
operations) , or an actual interval of kappa values to capture the strength of the 
observation. We illustrate the application of the algorithm. 

Example 2. Consider the network from Fig. 6, with strength factors provided by 
domain experts. We again observe that a patient has taken antibiotics and enter 
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this observation as ‘+ [—1,0]’ for node A. Node A propagates this ‘sign’ to T, 
which receives ‘+ [—1, 0] [1, 0] = — [1, 0]’ and sends this to D. Node D in turn 

receives [1, 0] 0 + [2, 0] = — [4, 0]’ and does not pass on any sign. Node A also 
sends its sign to F, which receives ‘+ [—1, 0] ® + [4, 3] = + [4, 3]’ and passes this 
on to D. Node D receives the additional sign ‘+ [4, 3] ® + [5, 3] = + [10, 6]’. The 
net influence of node Aon D therefore is [4, 0]©+ [10, 6]’ which equals [4, 0]’ 
if e is infinitesimal, and [5, 0]’ otherwise. Taking antibiotics thus decreases the 
chance of suffering from diarrhoea. Note that we are now able to resolve the 
represented trade-off. □ 

Inference in a kappa-enhanced network may become less efficient than in a regu- 
lar qualitative network, because strength factors change more often than signs. In 
theory, a strength factor could change upon each sign-addition enforcing prop- 
agation to take time polynomial in the number of chains to a single node in 
the digraph. Kappa-enhanced networks, however, allow for resolving trade-offs 
which qualitative networks do not. A polynomial bound on inference in kappa- 
enhanced networks can be ensured by limiting the number of sign-additions 
performed and reverting to the use of default intervals once this limit is reached. 
The use of default intervals may again lead to weaker results, but never to in- 
correct ones. Another option is to isolate the area in which trade-offs occur, use 
kappa-enhanced inference in that area and regular qualitative inference in the 
remaining network [10]. 



5 Conclusions and Further Research 

A drawback of qualitative probabilistic networks is their coarse level of detail. 
Although sufficient for some problem domains, this coarseness may lead to unre- 
solved trade-offs during inference in other domains. In this paper, we combined 
and extended qualitative probabilistic networks and kappa values. We introduced 
the use of kappa values to provide for levels of strength within the qualitative 
probabilistic network framework, thereby allowing for trade-off resolution. The 
kappa-enhanced networks are very suitable for domains in which all differences 
in probability are close to zero. Previous research has shown that Kappa net- 
works can give good results even for non-inflnitesimal e [5,11]. For our purpose, 
however, we feel that the little information we are depending on better be reli- 




Fig. 6. The kappa-enhanced Antibiotics network. 
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able. If e is not infinitesimal, a minor adaption upon sign-addition ensures that 
inference still leads to correct, though possibly weaker, results. 

This paper presents a possible way of combining qualitative probabilistic 
networks with elements from the kappa calculus. Other combinations may of 
course be possible. We adapted the basic sign-propagation algorithm for regular 
qualitative probabilistic networks, with new operators for propagating signs and 
strength factors in kappa-enhanced networks; the resulting algorithm may, how- 
ever, become less efficient. We already mentioned two possible solutions to this 
problem. Another possibility may be to exploit more elements from the kappa 
calculus: although NP-hard in general, under certain conditions, reasoning with 
kappa values can be tractable [12]; further research should indicate if strength 
factors may and can be propagated more efficiently using combination rules from 
the kappa calculus. 
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Abstract. An important feature of Qualitative Bayesian Networks is 
that they describe conditional independence models. However, they are 
not able to handle models involving logical constraints among the given 
variables. The aim of this paper is to show how this theory can be ex- 
tended in such a way to represent also the logical constraints in the graph 
through an enhanced version of Qualitative Bayesian Networks. The rel- 
ative algorithm for building these graphs (which is a generalization of 
the well-known algorithm based on D-separation criterion) is given. 

This theory is particularly fit for conditional probabilistic independence 
models based on the notion of cs-independence. This notion avoids the 
usual critical situations shown by the classic definition when logical con- 
straints are present. 



1 Introduction 

The important feature of Qualitative Bayesian Networks, and more in general 
of graphical models, is that they describe conditional independence models in- 
duced by a given probability. However, the graphical representation is apt to 
describe only the conditional independence statements and not the (possible) 
logical constraints among the variables. Actually, a first attempt in this direc- 
tion has been done by Pearl in [1] proposing an enhanced version of d-separation 
criterion (called D-separation), which encodes also the functional dependencies 
among the variables of the domain. But the functional dependencies are actu- 
ally only a particular case of logical constraints (see Section 2), which cannot 
be ignored; in fact, they have a crucial role from a practical point of view (see 
e.g. [2], [3]) and from a theoretic one: e.g. in the independence notion and in the 
inference processes [4], [5], [6]. 

In this paper we present an enhanced version of Qualitative Bayesian Net- 
works, which is able to describe both conditional independence statements and 
logical constraints among the variables. 

We show that this theory is particularly fit for the independence models in- 
duced by conditional probabilities, which cover more general situations and they 
avoid the well-known critical situations related to logical dependence relations 
and probability values 0 and 1 on possible (i.e. different to the impossible and 
certain event, respectively) events. 



T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 100-112, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




Qualitative Bayesian Networks with Logical Constraints 101 



These models do not necessarily satisfy the symmetric property [9], and the 
representation problem of asymmetric structures has been studied in [7] and [8]. 

In this paper we focus on a subclass of these conditional independence models 
(those closed under symmetric property), which obey graphoid properties (see 
[9]); in Section 3 some further relevant results concerning conditional indepen- 
dence have been proved. These results are deeply used in the next section. 

In Section 4 we deal with the representation problem of these independence 
model, showing how to describe both conditional independence statements and 
logical constraints. As in the classic independence models, obviously, also in 
our setting, there are some models not completely representable by a graph 
(even if the models obey graphoid properties, see Example 1), so, for a given 
independence model A4, the notion of minimal I-map (i.e. a directed acyclic 
graph such that every statement represented by it is also in M, while the graph 
obtained by removing any arrow from it would represent a statement not in A4) 
can be redefined - taking in account logical constraints - along the same lines of 
the classic case (see, e.g. [1], [10]). 

Moreover, we generalize the procedure for building such minimal I-maps prov- 
ing that any ordering on the variables gives rise to a minimal I-map for M . 



2 Logical Constraints 

A possible event denotes an event different from 0 and f2. Two distinct non- 
trivial partitions £i and £2 of f2 are logically independent if the “finer” partition 
£ (called also set of atoms) generated by them, coincides with the set of all 
possible conjunctions between the events of £i and £ 2 , i.e. 

£ = £^x£2 = {C = CiAC2^^ : Cl G ; C2 G £2}- 

Note that for simplicity we denote any atom of £i by Ci {i = 1,2). So, in such 
case the cardinality \£\ of £ is equal to \£\ \ -\£ 2 \ - In particular, the events A and 
B are logically independent if the partitions £1 = {A, A“} and £2 = {B, are 
logically independent, so the set of atoms is {AAB,AA A° A B, A° A S°}. 

A logical constraint exists between two partitions if they are not logical in- 
dependent, i.e. some conjunction of the kind Ci A C 2 is not possible (they are 
said also semi-dependent in [11]). 

Analogously, the partitions £ 1 , ... ,£„ are logically independent ii CiA. . .AC„ yf 0 
for any Ci G £t (with i = 1, . . . , n). Obviously, if n partitions are logically inde- 
pendent, then any k partitions (1 < fc < n) are logically independent too. On 
the other hand, n partitions £\, ... ,£n are not necessarily logically independent, 
even if for every choice of n— 1 partitions , ..., , they are logically indepen- 

dent; it follows that there is some logical constraint of the kind Ci A . . . AC„ = 0, 
with Ci G £i (while for any ii, ...,z„_i one has Cj^ A ... A Ci„_^ 0). 

For example, consider £i = {A,A‘^}, £2 = {B,B^} and £^ = {C, C^} three 
distinct partitions of fl such that AAB AC = 0. All the pairs of these partitions 
are logically independent, but the partition, for example £ 1 , is not logically 
independent of the partition generated by {£^ 2 ,^ 3 }- 
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Given n partitions and a logical constraint among them, it is possible to find 
the minimal subset {£\, . . . ,£k\ of partitions generating such constraint: there 
exists at least one combination of atoms, with Q G £i, such that CiA. . ./\Ck = 0, 
while Cl A . . . Cj-\ A Cj+\ A . . . A Cfc 0 for all j = 1, . . . , fc. Note that this 
does not imply that all the subsets of the k partitions are logically independent, 
because there could be another different logical constraint involving some subset. 

We say that such set of partitions {£\, . . . ,£k\ is the minimal set generating 
the given logical constraint. 

Remark 1. The fact that {fi, . . . , £1^} is the minimal set for a given logical con- 
straint does not imply that a partition, e.g. £k, is logically dependent on the 
others. In fact, £k is logically dependent on the other partitions (or better on 
the set of atoms generated by the other partitions) if for any Ck G £k and any 
atom C, generated by £\, ..., £k-i, one has either C C Cfc or C A Cfc = 0. Thus, 
the partition generated by £\, ...,£k-i refines £k- Therefore, the presence of some 
logical constraints is a situation more general than logical dependence. 

Analogously we say that a vector of random variables (Ai, ..., X„) is linked by 
a logical constraint, if there is a logical constraint among the partitions £i, ...,£„ 
generated, respectively, by Ai,...,A„. Moreover, for a given logical constraint 
we say that the sub-vector (Xi, Xk) is the minimal set for it, if £\,...,£k is 
the minimal set generating it. 

On the other hand, A„ is functionally dependent on (Ai, ..., A„_i) if its value 
is completely determined by the values assumed by the variables Ai, ..., A„_i, 
i.e. the partition is logical dependent on £\, 

3 Conditional Independence Models 

Conditional independence models have been developed for a variety of different 
uncertainty formalism (see e.g. [12]), here we recall that one based on conditional 
probability as defined by de Finetti [11], Krauss [13], Dubins [14]. 

Definition 1. Given a finite Boolean algebra B and an additive set H (i.e. 
closed w.r.t. finite disjunctions) such that 0 ^ "H, o conditional probability on 
B X TL is a function P(-|-) into [0, 1], which satisfies the following conditions: 

(i) P(-\H) is a finitely additive probability on B for any H G H; 

(it) P{H\H) = 1 for every H G H; 

(iii) P{E^A\H) = P{E\H)P{A\EAH), whenever E, A € B andH,EAH G P; 



Note that (Hi) reduces, when = 17, to the classic “chain rule” for probability 
P{E A A) = P{E)P{A\E). In the case Fo(’) = -f’(’l^) is strictly positive on B^, 
any conditional probability can be derived as a ratio (Kolmogorov’s definition) 
by this unique “unconditional” probability Pq. 

As proved in [6] , in all other cases to get a similar representation we need to 
resort to a finite family V = {Pq^ . . . ,Pk} of unconditional probabilities: every 
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Pa is defined on a proper set (taking = 13) 

■^a = {E & -^a-l ■ Pa-l(E) = 0}; 

and for each event B G (with = B\%) there exists a unique a such that 
Pa{B) > 0 and for every conditional event A\B one has 

, 1 ) 

The class V = {Pq, . . . , Pk} of probabilities, with Pa defined on Aa (for a = 
0, is said to agree with the conditional probability i^(-|-)- This class is 

unique if the conditional probability is defined on B x B^ . 

Let us recall one of the main feature of this approach: to assess an evaluation 
P directly on an arbitrary set of conditional events: 

Definition 2. The assessment P on an arbitrary family T = 
{ifil-ffi, En\Hn\ of conditional events is coherent if there exists with 

1C = B X TL with TL a B (B Boolean algebra, TL an additive set), such that P 
can be extended from T to K. as a conditional probability. 

For the checking of coherence and to find the agreeing class we refer to [6] . 

Definition 3. Let P be a coherent conditional probability on T and consider an 
agreeing class V = {Pqi -Pfc}- Eor any event E belonging to B^ we call zero- 
layer of E, with respect to V, the (non-negative) number a such that Pa{E) > 0 
(in symbols o(P) = a). Moreover, we define o(0) = +oo. 

While for any E\H G B x B^ we define o(E\H) = o[E A El) — o(iL). 

Note that the zero-layers depend on the chosen agreeing class P. 

In this framework the following definition of independence has been given in [4] : 

Definition 4. Given a coherent conditional probability P, defined on a family 
T containing T> = {A*\B* A C,B*\A* A C} (with A* - analogously B* - stands 
for either A or its contrary A^), A is conditionally cs-independent of B given C 
with respect to P if both the following conditions hold: 

(i) P{A\B AC) = P{A\B^ A C) ; 

(ii) there exists a class {Pa} of probabilities agreeing with the restriction of 
P to the family T>, such that 

o{A\B AC) = o{A\B‘^ AC) and o {A’^\B A C) = o{A‘^\B‘^ A C) . 

Note that if 0 < P{A\B A C) = P{A\B'^ A C) < 1 (so 0 < P{A‘^\B A C) = 
P{A'^\B'^ AC) < 1), then both equalities in condition (ii) are trivially satisfied 

o(^|PAC') = 0 = o(yl|P=AC') and o {A‘^\B A C) = Q = o{A'^\B^ A C) . 

Hence, in this case condition (i) completely characterizes conditional cs-indepen- 
dence, and, in addition, also B is cs-independent of A given C, so this definition 
coincides with the classic formulations when also P{B\C) and P{C) are in (0, 1). 
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However, in the other cases (when P{A\B A C) is 0 or 1) condition (i) needs 
to be “reinforced” by the requirement that also their zero-layers must be equal, 
otherwise we can meet critical situations (see, e.g. [6]). 

Remark 2. Even if different classes agreeing with P|x) may give rise to different 
zero-layers, nevertheless it has been proved in [6] that condition (ii) of Definition 
4 either holds for all the agreeing classes of P or for none of them. 

Notice that for every event A the notion of cs-independence is always irreflexive 
(also when the probability of H is 0 or 1) because o(A\A) = 0, while o[A\A^^) = 

oo. Actually, the following result holds (see [4], [9]): 

Theorem 1. Let A and B two possible events such that Af\Cfftb and A ^ C. 
If AJLcsB\C under a conditional probability P, then A and B are logically inde- 
pendent with respect to C (i.e. all the events A* A B* A C are possible). 

Note that 

AJL,,B\C^P{A\BAC) = P{A\C), (2) 

it means that cs-independence implies classic independence notion, but the con- 
verse implication does not hold (as shown above, for more details see [4]). 

In [4] also a theorem characterizing cs-independence of two logically inde- 
pendent events A and B in terms of probabilities P{B\C), P{B\A A C) and 
P{B\A‘^ A C) is given, giving up any direct reference to the zero-layers. 

Theorem 2. Let A, B be two events logically independent with respect to the 
event C. If P is a coherent conditional probability such that P{A\B A C) = 
P{A\B'^ AC), then AJLcsB \ C if and only if one (and only one) of the following 
conditions holds: 

(a) 0 < P{A\B AC) < 1; 

(b) P{A\B A C) = 0 and the extension of P to B\C and B\A A C satisfies one 
of the following conditions 

1. P{B\C) = 0, P{B\A A C) = 0, 

2. P{B\C) = 1, p\b\A AC) = l, 

5. 0 < P{B\C) < 1, 0 < P{B\A AC)< 1; 

(c) P{A\B A C) = 1 and the extension of P to B\C and B\A‘^ A C satisfies one 
of the following conditions 

1. P{B\C) = 0, P(P|A“ A C) = 0, 

2. P{B\C) = I, P(P|A“ A C) = 1, 

5. 0 < P{B\C) < 1, 0 < P{B\A'= AC)<1. 

The probability assessments satisfying the symmetric property (when it makes 
sense), are characterized by the next result (proved in [9]). 

Proposition 1. Let A, B be two events logically independent with respect to C. 
If P is a coherent probability, then AJLcsB\C [P] and BILcsA\C [P] iff 

P{A\B AC) = P{A\B‘^ AC) and P(P|A A C) = P(P|A" A C) (3) 




Qualitative Bayesian Networks with Logical Constraints 105 



Remark 3. The validity of A3LcsB\C and its symmetric statement under P is 
again more general than that described in the classic kolmogorovian context: in 
fact the two conditions coincide only if P{A\C) and P{B\C) are in (0, 1); while 
the conditional probabilities P{B\A A C) and P{A\B A C) are not defined when 
P{A\C) or P{B\C) is 0. In conclusion, Proposition 1 points out that requiring the 
symmetry only condition (3) must be checked and the different cases considered 
by Theorem 2 are absorbed. 

Indeed, in the quoted paper the definition of cs-independence has been extended 
to the case of finite sets of events and to finite random variables. 



Definition 5. Let Si,S 2,£3 be three different finite partitions such that £2 is 
not trivial. The partition £\ is independent of £2 given £3 under the coherent 
conditional probability P (in symbols £li_lLcs£l 2|^3 [P]) iff Cij^JLcsCifiCi^ [P] for 
every Ci^ G £\,Ci.^ € £ 2 , Ci^ G £3 such that Ci^ A Ci^ 0, and Ci^ fi- Ci^. 



Let X = (Xi, . . . , Xn) be a finite random vector with values in Rx C R". The 
partition generated by X is denoted by £x = {X = x : x € Rx}- 
Definition 6. Let {X,Y,Z) be a finite discrete random vector with values in 
R C Rx X Ry X Rz and £x, £y , £z the partitions generated, respectively, by 
X,Y and Z. Let P be a conditional probability on T {A\BC : A G £xi B G 
£y, C G £z}: then X is cs-independent ofY given Z with respect to P (in symbol 

X±,,YIZ [P]) iff£xXc.£Y\£z [P]. 

The following result is a generalization of Proposition 1 . 

Theorem 3. Let £±,£2 be two logical independent partitions, given a conditional 
probability P, the statement £iJLcs £2 and its symmetric hold under P iff for any 
Ai G £1 and Bj G £2 one has 

P{A\Bfi = P{Ai) and P(P,|T,) = P(P,). (4) 



Proof: From implication (2) it follows that if £\TLcs£ 2 , then for Ai G £\ one has 
P{A^\Bfi = P{Ai) for any Bj G £ 2 - 

Conversely, if P{Ai\Bj) = P{Ai) holds for any Bj G £ 2 , then one has from 
disintegration property and equation (1), being o(P|) = a (with a equal to 0 if 
P{Bj) > 0 and 1 when P{Bj) = 1), that 



Y^PMaBu) ^P(Gl,|Pfe)P,(Pfc) 



P{MB() = 









Pc{B() 



Po.m 



= P{Af). 



Therefore, for any Ai the first equality of condition (3) holds for any Bj G £ 2 - 
Analogously, the second equality of the condition (3) can be proved from the 
second equality of condition (4). So, the conclusion follows from Proposition 1. 
Theorem 3 can be generalized as follows (the proof goes along the same line): 



Theorem 4. Let {X, Y, Z) be a random vector such that any Ai G £x, Bj G £y 
are logically independent with respect to any Ck G £z if BjACk fi- ffb,Ck}- Given 
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a conditional prohahility P, the statement Xi_lLcs^2|-^3 and its symmetric hold 
under P iff for any Xi € TZx, Vj G Ti-v and Zk € TZz one has 

P{X = Xi\Y = yj, Z = zk) = P(X = x,\Z = zu) (5) 

P{Y = y^\X = x„Z = Zk) = P{Y = y,\Z = Zk). (6) 

Remark 4- From Theorem 4 it comes out the crucial role of logical con- 
straints. Actually, it can happen that equations (5) and (6) hold even if the 
cs-independence statement is not valid because of a logical constraint involving 
both X and Y. 

The set Mpoi cs-independence statements induced by a coherent conditional 
probability P of the form XiXcsXj\Xk, with /, J, K three disjoint subsets, 
is called cs-independence model. Every cs-independence model induced by P is 
closed with respect to the following properties (for the proof see [9]): 
Decomposition property 

Xi±,,[Xj,Xk]\Xw [P] Xi±,,Xj\Xw [P]; 

Reverse decomposition property 

[Xi,Xj]±,sXw\Xk [P] ^ Xi±,sXw\Xk [p]; 

Weak union property 

XiJL,,[Xj,Xk]\Xw [P] ^ Xi±,,Xj\[Xw,Xk] [P]; 

Contraction property 

XiJL,,Xw\[Xj,XK][P]kXi±,,Xj\XK [P]^ Xi±,,[Xj,Xw]\[Xk][P]; 
Reverse contraction property 

XiJL,,Xw\[Xj,XK][P]kXj±,sXw\XK[P] => [Xi,Xj]JL,,Xw\[Xk][P]; 

Intersection property 

XiJL,,Xj\[Xw,Xk] [P] kXiX,,Xw\[Xj,XK] [P] ^ 

Xi±,s[Xj,Xw]\[Xk] [p]; 

Reverse intersection property 

Xi±,sXw\[Xj,Xk] [P]kXjX,,Xw\[Xi,XK] [P] => 

[Xj,Xj]±,,Xw\[Xk] [P]- 

Hence, these models satisfy all graphoid properties (see [1]) except the sym- 
metry property A/ JLcsAj I XjJLcsXjIXk and reverse weak union prop- 

erty [Xj,Xw]XcsXi\[Xk] ^ Xj±,,Xi\[Xw,Xk]. 

However, the conditional independence models closed under symmetric prop- 
erty, obey graphoid properties [9]. 

In the sequel we deal only with such class of independence models. As noted in 
Remark 3 this class is larger than the class of probability distributions inducing 
a graphoid structure according to classic independence notion. 




Qualitative Bayesian Networks with Logical Constraints 107 



4 Qualitative Bayesian Network with Logical Component 

In this section we focus our attention on the representation problem of the 
conditional independence models and logical constraints. 

A l-graph is a triplet G = (V,E,B), where F is a finite set of vertices, if is a 
set of edges defined as a subset of V x V (i.e. set of all ordered pairs of distinct 
vertices), and B = {B : i?cy}isa family of subsets of vertices (called logical 
components). 

The vertices are represented by circles, and each B G B hy a, box enclosing 
those circles corresponding to vertices in B. 

Definition of 1-graph differs from that of graph (see [15], [1]), since our interest 
for B is to cluster the sets of variables linked by some logical constraint. More 
precisely, every vertex v & V or subset I C V is associated to a variable Xy 
or to a random vector Xj, respectively, and a box B = {v : v € J} visualizes 
the minimal set of random variables {Xy : v € J} whose partitions generate the 
given logical relation (see Section 3). 

For example, in the 1-graph in Figure 1 (which has no edges) the box 
i? = {1, 2, 3} is a logical component and represents a logical relation among the 
variables Xi,X 2 ,X 3 '. it could be, e.g. that the event (Xi = 1,^2 = l,Aa = 1) 
is not possible. 



© © 

@ B 






Fig. 1. Graphical representation of variables linked by a logical constraint 



More precisely, each box is used to emphasize where the logical constraint is 
localized, so which variables are involved. 

An edge {u, v) £ E such that its “opposite” {v, u) is not in E is called directed. 
A directed edge (u, v) is represented hy u ^ v and u is said to be a parent of 
V and v a child of u. If a 1-graph has only directed edges, it is called directed 
1-graph. 

A path of length n from t6 to u is a sequence of distinct vertices u = 
Uo,...,Un = v {n > 1) such that either (Mj,Mj+i) S E and (ui_|_i,Ui) ^ E 
or (ui+i,Ui) € E and (ui,Ui+i) ^ E for i = 0, . . . ,n — 1. A path from m to f is 
directed if Ui — >■ Ui+i for alH = 0, . . . , n — 1. 

If there is a directed path from m to v, we say that u leads to v, and we 
denote it as u I). 

An n-cycle is a directed path of length n with — >■ uq. A directed 1-graph 

is acyclic if it contains no cycles. 

The vertices u such that u ^ v and there is no path from v to u, are the 
ancestors an{v) of v, the descendants ds{u) of u are the vertices v such that 
v and there is no path from v to u. 
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Note that, according to our definition, a sequence consisting of one vertex is 
a directed path of length 0, and therefore every vertex is its own descendent and 
ancestor, i.e. u G an{u),u € ds(u). 



4.1 Separation Criteria for Directed Acyclic 1- Graph 

Before introducing the new separation criterion, we recall the classical definition 
of blocked path (see, e.g. [1]). 

Definition 7. Let G be an acyclic directed graph. A path ui , . . . , Um in G is 
blocked by a set of vertices S C V, whenever there exists a triplet of connected 
vertices u,v,w on the path such that of the following condition holds: 

1. either u ^ v, v ^ w or w ^ v, v ^ u, and v € S 

2. v ^ u, v ^ w and v € S 

3. u ^ V, w ^ V and ds{v) ^ S 

The conditions can be illustrated by Figure 2 where the grey vertices are those 
belonging to S. 




Fig. 2. Blocked paths 



Vertex d-separation criterion [1] requires that every path going from one set 
to the other is blocked. The following generalization of this criterion is obtained 
using the notion of logical components. 

Definition 8. Let G = (V,E,B) be an acyclic directed l-graph and let Vi, V 2 
and S be three disjoint sets of vertices ofV. The set of vertices Vi is dl-separated 
from V 2 through S in the directed acyclic l-graph G (in symbol {Vi,V 2 \S)q) if 
the following conditions hold: 

1. every path in G from V\ to V 2 is blocked by S; 

2. there is no Bi G B such that Bi C Vi U V 2 LI S, and both sets Bi fl V\ and 
Bi n V 2 are not empty. 

Obviously when the set of logical components B is empty, dl-separation criterion 
coincides with d-separation. It is easy to check that the possible boxes Bi in the 
three situations of Figure 2 can be formed only by {u,v} and {r!,rt;}. Actually, 
putting a logical component, e.g. B = {u,v,w} in the left-side of Figure 2, the 
vertexes u and w are not dl-separated given v, even if u, v,w is a blocked path. 
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The statement “the set of vertices Vi is dl-separated by V 2 given S” in a 
1-graph G = (V,E,B) is denoted as {Vi,V 2 \S)q. The difference between dl- 
separation criterion and the classical one [1] is established by the condition 2. 
of Definition 8. Therefore, to detect the properties of dl-separation criterion, we 
must check the graphoid properties verified by the relation X1-qY\Z\ 

eB BCXUYUZ^BCXLlZorBCYLlZ. 

The following result has been proved in [7]: 

Theorem 5. The relation XTlsY\Z is a graphoid. 

It is well known that d-separation criterion for Bayesian Networks satisfies 
graphoid properties [1], so by the previous result we get the following result: 



Corollary 1. The vertex dl-separation criterion verifies graphoid properties. 

4.2 Representation Problem through Directed Acyclic 1-Graphs 

Given an independence model M, we look for a directed acyclic 1-graph G able 
to visualize the logical constraints by means of the boxes, and to describing all 
the statements T in Al. If such G describing all T G A1 exists, then we say that 
G is a perfect map for the model Ai . Generally, it is not always feasible to have 
a perfect I-map G for M (see Example 1). 

Therefore, we need to introduce, analogously as done in [1], the notion of 
I-map and the related algorithm to build it. 

Definition 9. A directed acyclic l-graph G is an I-map for a given independence 
model Ai iff every independence statement represented by means of dl-separation 
criterion in G is also in Ai . 

Thus an I-map G for Ai may not represent every statement of Ai, but the ones it 
represents are actually in AI, it means that the set AIg of statements described 
by G is contained in Ai. 

An I-map G for Ai is said minimal if removing any arrow from the 1-graph 
G, the obtained 1-graph will no longer be an I-map for Ai. 

Given an independence model AI over the random vector (Ai,...,A„) pos- 
sibly linked by some logical constraints, let B the set of logical components and 
for any ordering tt = (tti, ..., 7t„) on the variables, consider the graph G obtained 
according to the following procedure (which is an enhanced version of that given 
by Pearl): draw the vertices according the ordering tt and the boxes B £ B] 
for each vertex tt^, let CA,, = {tti, ...,7Tj_i} be the set of vertices before tt^, and 
draw an arrow pointing to from each vertex in C , which satisfies 
the following rules: if nj G B, then B fl is empty, with Rt^. = \ Dt^.; 

moreover A 7 r,-U-csA/j \Xd G Ai\ remove from the boxes B the superfluous 
arrows (if there are). 

In other words, Dt^. is the set of parents of tt^ . 

Given the ordering tt, a suitable basic list of independence statements Ot^ 
arises from Ai, and it allows us to build a directed acyclic 1-graph. Actually, 
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such graph is an I-map for JH; the proof is analogous to that given in [10]. In 
fact, the main difference consists in the set of logical components, which does 
not depend on the chosen ordering. 

Theorem 6. Let M. be an independence model induced by a conditional prob- 
ability, and 7T an ordering on the random variables. Then the directed acyclic 
l-graph G generated by 0 ^ is an I-map for M. . 

Now, we give an example to show how to build the I-maps for a given conditional 
independence model. 

Example 1. Let Xi,X2,X^ be three random variables: the codomain of X\ is 
{0, 1,2}, while the other two variables take values in {0, 1} and they are linked 
by the logical constraint {Xi = 0) A (X3 = 1) = 0. Consider the following 
assessment P (the sub- vector (Xi,Xj) is denoted as Xij) 
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and suppose that P{X2 = IjXi = 1,^3 = 1) = /3 = P{X2 = IjXi = 2,^3 = 1), 
while P{-\X^ = 1 ,X^ = 1) over X12 is defined as follows 





X 2 = 1 


X2=0 


Ni = 0 
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= 1 


(&i/3 - c)/3 


(6i/3-c)(l-/3) 


(bi + b 2 )P - c - d 


(bi + 62 )/! — c — d 


Ni = 2 


( 62/3 - d)/3 
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-0 


(bi + b 2 )P - c - d 


(&i + 62 )/! - c - d\ 



where a, ( 3 , 7&1, 62, c, d parameters in (0, 1) with a-\-bi-\-b2 = 1 and a-\-c-\-d = 1. 

Obviously, if 61 yf ^ and 62 then the values of conditional proba- 

bility P{X2 = IjXi = l,Jf3 = 1 ) and P{X2 = IjXi = 2,^3 = 1 ) can be 
easily computed by the joint probability. Note that, if 61 = ^ (or 62 = |), 
then P{X, = 11X2 = 1,^3 = 1 ) = 0 = P{Xi = 11X2 = 0,^3 = 1 ) (or 
P{Xi = 21^2 = 1,^3 = 1 ) = 0 = P{Xi = 21X2 = 0,^3 = 1 )). 

However, for any value of the parameters (also b\ = ^ and/or 62 = it 
is easy to verify using Theorem 4 that the conditional independence model 
M induced by P is composed by the statements XiTLcsX2\Xz, XiJLcsX 2, 
{X^,X2)±csXi\X3, (so Xi±,,X^\{X2,X3),X2±M{X^,X3)) and their sym- 
metric ones. 

Therefore there is not a unique minimal I-map, in Fig. 3 two possible ones 
are shown: that in the left-side is related to, e.g., the ordering tti = (1,2, 3, 4), 
while that in the right-side is related to, e.g tt2 = (4, 3, 1, 2). 
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Fig. 3. Minimal I-maps for M 



Actually, the picture in the left-side represents the following independence 
statements XiJLcgX2, X4JLca{Xi, X2)\X^ and their symmetric ones, while that 
one on the right-side describes the statements Ai_lLcsA2|A3, (Ai, X2)_lLcsA4|X3 
and their symmetric ones. Note that these two graphs are minimal I-maps; in 
fact removing any arrow from them, we may read independence statements not 
in Alp. The block B = { 1 , 3 } localizes the given logical constraint such that 
(Ai = 0 ) A (A3 = 1 ) = 0 . 

Actually, a perfect map G for a given independence model over n variables (if 
it exists) can be found detecting the (possible) n! I-maps. More precisely, such 
orderings, which give rise to G, are all the orderings compatible with the partial 
order induced by G. 

5 Conclusions 

An enhanced version of Qualitative Bayesian Networks has been presented and 
its main properties have been shown. This is particularly useful for effective 
description of independence models and logical constraints among the given 
variables. We have generalized the procedure (given through d-separation) for 
finding, for any independence model Al, given an ordering on the variables, a 
minimal I-map for AJ. 

Along the same lines, we can generalize the well-known classic separation 
criterion for undirected graphs [ 15 ]; this can be done by introducing the notion 
of logical components in undirected graphs and by redefining the separation 
criterion. 

In this paper we have shown that this theory is particularly fit for conditional 
probabilistic independence models, but we want to stress that it can be used also 
for independence models induced by other uncertainty measures. 
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Abstract. A qualitative probabilistic network models the probabilistic 
relationships between its variables by means of signs. Non-monotonic in- 
fluences are modelled by the ambiguous sign ’?’, which indicates that the 
actual sign of the influence depends on the current state of the network. 
The presence of influences with such ambiguous signs tends to lead to 
ambiguous results upon inference. In this paper we introduce the concept 
of situational influence into qualitative networks. A situational influence 
is a non-monotonic influence supplemented with a sign that indicates its 
effect in the current state of the network. We show that reasoning with 
such situational influences may forestall ambiguous results upon infer- 
ence; we further show how these influences change as the current state 
of the network changes. 



1 Introduction 

The formalism of Bayesian networks [1] is generally considered an intuitively 
appealing and powerful formalism for capturing the knowledge of a complex 
problem domain along with its uncertainties. The usually large number of prob- 
abilities required for a Bayesian network, however, tends to pose a major obstacle 
to the construction [2]. Qualitative probabilistic networks (QPNs), introduced as 
qualitative abstractions of Bayesian networks [3] , do not suffer from this quantifi- 
cation obstacle. Like a Bayesian network, a qualitative network encodes variables 
and the probabilistic relationships between them in a directed graph; the rela- 
tionships between the variables are not quantified by conditional probabilities as 
in a Bayesian network, however, but are summarised by qualitative signs instead. 
For inference with a qualitative probabilistic network an efficient algorithm is 
available, based on the idea of propagating and combining signs [4]. 

Although qualitative probabilistic networks do not suffer from the obstacle 
of requiring a large number of probabilities, their high level of abstraction causes 
some lack of representation detail. As a consequence, for example, qualitative 
networks do not provide for modelling non-monotonic influences in an informa- 
tive way. An influence of a variable A on a variable B is called non-monotonic 
if it is positive in one state and negative in another state of the network. Such a 
non- monotonic influence is modelled by the ambiguous sign ’?’. The presence of 
influences with such ambiguous signs typically leads to ambiguous, and thereby 
uninformative, results upon inference. 
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Non-monotonicity of an influence in essence indicates that the influence can- 
not be captured by an unambiguous sign of general validity. In each particular 
state of the network, however, the influence is unambiguous. In this paper we 
extend the framework of qualitative probabilistic networks with situational in- 
fluences that capture information about the current effect of non-monotonic 
influences. We show that these situational influences can be used upon inference 
and may effectively forestall ambiguous results. Because the sign of a situational 
influence depends on the current state of the network, we investigate how it 
changes as the state of the network changes. We then adapt the standard prop- 
agation algorithm to inference with networks including situational influences. 

The remainder of this paper is organised as follows. Section 2 provides some 
preliminaries on qualitative probabilistic networks. Section 3 introduces the con- 
cept of situational influence. Its dynamics are described in Sect. 4, which also 
gives an adapted propagation algorithm. The paper ends with some conclusions 
and directions for further research in Sect. 5. 

2 Preliminaries 

Qualitative probabilistic networks were introduced as qualitative abstractions of 
Bayesian networks. Before reviewing qualitative networks, therefore, we briefly 
address their quantitative counterparts. 

A Bayesian network is a concise representation of a joint probability distri- 
bution Pr on a set of statistical variables. In the sequel, (sets of) variables are 
denoted by upper-case letters. For ease of exposition, we assume all variables to 
be binary, writing a for A = true and a for A = false. We further assume that 
true > false. Each variable is now represented by a node in an acyclic directed 
graph. The probabilistic relationships between the variables are captured by the 
digraph’s set of arcs. Associated with each variable A is a set of (conditional) 
probability distributions Pr(A | 7t(A)) describing the influence of the parents 
7t(A) of A on the probability distribution for A itself. 

Example 1. We consider the small Bayesian network shown in Fig. 1. The 
network represents a fragment of fictitious knowledge about the effect of training 
and fitness on one’s feeling of well-being. Node T models whether or not one has 
undergone a training session, node F captures one’s fitness, and node W models 
whether or not one has a feeling of well-being. □ 



Pr(t) = 0.1 (t 




Pr(/) = 0.4 



Pr(w I tf) = 0.90, Pr(w | tf) = 0.05 (IF) Pr(w | tf) = 0.75, Pr(w | tf) = 0.35 



Fig. 1. An example Bayesian network, modelling the influences of training (T) and 
fitness (F) on a feeling of well-being (IF) 
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In its initial state where no observations for variables have been entered, a 
Bayesian network captures a prior probability distribution. As such evidence 
becomes available, the network converts to another state and then serves to 
represent the posterior distribution given the evidence. 

Qualitative probabilistic networks bear a strong resemblance to Bayesian net- 
works. A qualitative network also comprises an acyclic digraph modelling vari- 
ables and the probabilistic relationships between them. Instead of conditional 
probability distributions, however, a qualitative probabilistic network associates 
with its digraph qualitative influences and qualitative synergies, capturing fea- 
tures of the existing, albeit unknown, joint distribution Pr [3]. 

A qualitative influence between two nodes expresses how the values of one 
node influence the probabilities of the values of the other node. For example, 
a positive qualitative influence of a node A on a node B along an arc A ^ B, 
denoted B), expresses that observing a high value for A makes the higher 

value for B more likely, regardless of any other direct influences on B, that is 

Pr(& I ax) — Pr(6 | ax) > 0 , 

for any combination of values x for the set tt{B) \ {A} of parents of B other than 
A. A negative qualitative influence, denoted S~ , and a zero qualitative influence, 
denoted 5°, are defined analogously. A non-monotonic or unknown influence of 
node A on node B is denoted by S^{A,B). 

The set of all influences of a qualitative network exhibits various important 
properties [3]. The property of symmetry states that, if the network includes 
the influence S^{A,B), then it also includes S^{B,A), S € {-I-,— ,0,?}. The 
transitivity property asserts that the signs of qualitative influences along a trail 
without head-to-head nodes combine into a sign for the net influence with the 
(^-operator from Table 1. The property of composition asserts that the signs of 
multiple influences between two nodes along parallel trails combine into a sign 
for the net influence with the ©-operator. 

Table 1. The and ©-operators for combining signs 
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A qualitative probabilistic network further includes additive synergies. An 
additive synergy expresses how two nodes interact in their influence on a third 
node. For example, a positive additive synergy of a node A and a node B on a 
common child C, denoted Y^{{A, B}, C), expresses that the joint influence of A 
and B on C exceeds the sum of their separate influences regardless of any other 
direct influences on C, that is 

Pr(c I ahx) + Pr(c | dbx) > Pr(c | ahx) + Pr(c | dbx ) , 
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for any combination of values x for the set Tr(C') \ {A, B} of parents of C other 
than A and B. A negative additive synergy, denoted Y~ , and a zero additive syn- 
ergy, denoted Y^, are defined analogously. A non-monotonic or unknown additive 
synergy of nodes A and B on a common child C is denoted by Y ■ {{A, B} , C) . 

Example 2. We consider the qualitative abstraction of the Bayesian network 
from Fig. 1. From the conditional probability distributions specified for node W, 
we have that Pr(rc | tf) — Pr('u; \ tf) > 0 and Pr(w | tf) — Pr(w | tf) > 0, and 
therefore that S~^{F,W): fitness favours well-being regardless of training. We 
further have that Pr('u; | tf) — Pr(rt; \ tf) > 0 and Pr(w | tf) — Pr(w | if) < 0, 
and therefore that S'’(T, W) \ the effect of training on well-being depends on one’s 
fitness. From Pr(w | tf) + Pr('u; | if) > Pr('u; | tf) + Pr(w | if), to conclude, we 
find that Y^{{T, F},W). The resulting qualitative network is shown in Fig. 2; 
the signs of the qualitative influences are shown along the arcs, and the sign of 
the additive synergy is indicated over the curve over variable W . □ 




Fig. 2. The qualitative abstraction of the Bayesian network from Fig. 1 

We would like to note that, although in the previous example the qualitative 
relationships between the variables are computed from the conditional proba- 
bilities of the corresponding quantitative network, in realistic applications these 
relationships are elicited directly from domain experts. 

For inference with a qualitative probabilistic network, an efficient algorithm 
based on the idea of propagating and combining signs is available [4]. This al- 
gorithm traces the effect of observing a value for a node upon the other nodes 
in a network by message passing between neighbouring nodes. The algorithm 
is summarised in pseudo-code in Fig. 3. For each node V, a node sign ’sign[P]’ 
is determined, indicating the direction of change in its probability distribution 
occasioned by the new observation; initial node signs equal ’O’. Observations are 
entered as a ’-I-’ for the observed value true, or a ’— ’ for the value false. Each 
node receiving a message updates its sign using the ©-operator and subsequently 
sends a message to each neighbour that is not independent of the observed node. 
The sign of this message is the ©-product of the node’s (new) sign and the sign 
of the influence it traverses. This process of message passing between neighbours 
is repeated throughout the network, building on the properties of symmetry, 
transitivity, and composition of influences. Since each node can change its sign 
at most twice (once from ’0’ to ’+’, ’— ’ or ’?’, and then only to ’?’), the process 
visits each node at most twice and therefore halts in polynomial time. 

Example 3. We consider the qualitative network shown in Fig. 4. Suppose that 
we are interested in the effect of observing the value false for node A upon the 
other nodes in the network. Prior to the inference, the node signs for all nodes 
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procedure Process-Observation(Q,0,si(/n): 
for all Vi e V(G) in Q 
do sign[Fi] ^’0’; 

Propagate-Sign(Q, 0,0, sign). 

procedure Propagate-Sign(Q, troiZ, to, message) : 
sign[io] t— signfio] © message, 
trail t— trail U {to}; 
for each neighbour Vi of to in Q 
do linksign t— sign of influence between to and Vi; 
message t— sign[io] © linksign; 
if Vi 0 trail and sign[Vi] / sign[V] © message 
then Propagate-Sign(Q, trail, Vi,nressage). 



Fig. 3. The sign-propagation algorithm 



are set to ’O’. Inference now starts with node A receiving the message Node 
A updates its node sign to 0 © — = — , and subsequently computes the messages 
to be sent to its neighbours E, B and D. To node E, node A sends the message 
— ©— = +. Upon receiving this message, node E updates its node sign to 
0 © + = +. Node E does not propagate the message it has received from A to 
node B because A and B are independent on the trail A ^ E ^ B. To node 
B, node A sends the message — © ? = ?. Upon receiving this message, node B 
updates its node sign to 0©? =?. Node B subsequently computes the message 
? © + = ? for U. Upon receiving this message, node E updates its node sign to 
+ ©? =?. Node B does not propagate the message it has received from A to 
node C because A and C are independent on the trail A ^ B C . Exploiting 

the property of symmetry, node A sends the message — © + = — to node D. 
Upon receiving this message, node D updates its node sign to 0 © — = — . 
Node D subsequently computes the message — © + = — for C. Upon receiving 
this message, node C updates its node sign to 0 © — = — . Node C then sends 
the message — © — = + to i?, upon which node B should update its node 
sign to ? © + =?. Since this update would not change the node sign of B, the 
propagation of messages halts. The node signs resulting from the inference are 
shown in the network’s nodes in Fig. 4. □ 



D A 




Fig. 4. A qualitative network and its node signs after the observation A 



false 
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3 Situational Influences 

The presence of influences with ambiguous signs in a qualitative network is likely 
to give rise to uninformative ambiguous results upon inference, as illustrated in 
Example 3. We take a closer look at the origin of these ambiguous signs. We 
observe that a qualitative influence of a node A on a node B along an arc B 
is only unambiguous if the difference Pr(6 | ax) — Pr(6 | ax) has the same sign for 
all combinations of values x for the set X = 7r(i?)\{A}. As soon as the difference 
Pr(& I ax) — Pr(& | ax) yields contradictory signs for different combinations x, 
the influence is non-monotonic and is assigned the ambiguous sign In each 
speciflc state of the network, associated with a speciflc probability distribution 
Pr(A) over all combinations x, however, the influence of A on B is unambiguous, 
that is, either positive, negative or zero. To capture the current sign of a non- 
monotonic influence in a speciflc state, we introduce the concept of situational 
influence into the formalism of qualitative probabilistic networks. 

We consider a qualitative network as before and consider the evidence e 
entered so far. A positive situational influence of a node A on a node B given e, 
denoted S'e^ 7A, B), is a non-monotonic influence of A on B for which 

Pr(6 I ae) — Pr(6 | ae) > 0 . 

In the sequel we omit the subscript e from Se^ as long as ambiguity cannot 
occur. A negative situational influence, denoted and a zero situational 

influence, denoted are defined analogously. An unknown situational in- 
fluence of node A on node B is denoted by B). The sign between the 

brackets will be called the sign of the situational influence. A qualitative network 
extended with situational influences will be called a situational qualitative net- 
work. Note that while the signs of qualitative influences and additive synergies 
have general validity, the signs of situational influences pertain to a speciflc state 
of the network and depend on Pr(X). 

Example 4. We consider once again the network fragment from Fig. 1 and its 
qualitative abstraction shown in Fig. 2. The qualitative influence of node T on 
node W was found to be non-monotonic. Its sign therefore depends on the state 
of the network. In the prior state of the network where no evidence has been 
entered, we have that Pr(/) = 0.4. Given this probability, we And Pr(w | f) = 
0.39 and Pr(w | f) = 0.51. From the difference Pr(ii; | f) — Pr('u; | f) = —0.12 
being negative, we conclude that the influence of node T on node W is negative 
in this particular state. The current sign of the influence is therefore The 
situational qualitative network for the prior state is shown in Fig. 5. The dynamic 
nature of the sign of the situational influence is illustrated by a change from ’ 
to ’-k’ after, for example, the observation F = true is entered into the network, 
in which case Pr(w | tf) — Pr(ru | tf) = 0.90 — 0.75 = 0.15. □ 

Once again we note that, although in the previous example the sign of the 
situational influence is computed from the quantitative network, in a realistic 
application it would be elicited directly from a domain expert. In the remainder 
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Fig. 5. The network from Fig. 2, now with the prior situational influence of T on IF 

of the paper, we assume that the expert has given the signs of the situational 
influences for the prior state of the network. 

4 Inference with a Sitnational Qualitative Network 

For inference with a regular qualitative probabilistic network, an efficient algo- 
rithm is available that is based on the idea of propagating and combining signs 
of qualitative influences, as reviewed in Sect. 2. For inference with a situational 
qualitative network, we observe that the sign of a situational influence indicates 
the sign of the original qualitative influence in the current state of the network. 
After an observation has been entered into a situational network, therefore, the 
signs of the situational influences can in essence be propagated as in regular 
qualitative networks, provided that these signs are still valid in the new state 
of the network. In this section we discuss how to verify the validity of the sign 
of a situational influence as observations become available that cause the net- 
work to convert to another state. In addition, we show how to incorporate this 
verification into the sign propagation algorithm. 

4.1 Dynamics of the Signs of Situational Influences 

We begin by investigating the simplest network fragment in which a non- 
monotonic qualitative influence can occur, consisting of a single node with two 
independent parents. We show for this fragment how the validity of the sign of 
the situational influence can be verified during inference by exploiting the associ- 
ated additive synergy. We then extend the main idea to more general situational 
networks. 




Fig. 6. A fragment of a situational network, consisting of node B and its parents A 
and C, with and Y^H{A,C},B) 

We consider the network fragment from Fig. 6, consisting of node B and its 
mutually independent parents A and C. We assume for now that the nodes A 
and C remain independent as observations are being entered into the network. 
By conditioning on A and C, we find for the probability of b: 

Pr(6) = Pr(a) • [Pr(c) • (Pr(6 | ac) — Pr(6 | ac) — Pr(6 | dc) + Pr(& | dc)) + 

Pr(6 I ac) — Pr(6 | dc)] -I- Pr(c) • (Pr(6 | dc) — Pr(6 | dc)) -I- Pr(6 | dc) . 
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We observe that Pr(5) is a function of Pr(a) and Pr(c), and that for a fixed Pr(c), 
Pr(6) is a linear function of Pr(a). For Pr(a) = 1, the function yields Pr(& | a); 
for Pr(a) = 0, it yields Pr(6 | d). Moreover, the gradient of the function at a 
particular Pr(c) matches the sign of the situational influence of node A on node 
B for that Pr(c). In essence, we have two different, so-called, manifestations 
of the non-monotonic influence of ^ on either the sign of the situational 
influence is negative for low values of Pr(c) and positive for high values of Pr(c), 
as shown in Fig. 7, or vice versa, as shown in Fig. 8. 




Fig. 7. An example Pr(&) as a function of Pr(a) and Pr(c), with S'^ {A, B), {C, B) 

and Y+{{A,C},B) 




Fig. 8. An example Pr(&) as a function of Pr(a) and Pr(c), with S'^ {A, B), {C, B) 

and Y~{{A,C},B) 
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As a result of observations being entered into the network, the probability 
of c may change. The sign of the situational influence of node A on node B 
may then change as well. For some changes of the probability of c, however, 
the sign will definitely stay the same. Whether or not it will do so depends 
on the manifestation of the non-monotonic influence, on the current sign, and 
on the direction of change of the probability of c. In the graph depicted in 
Fig. 7, for example, the sign of the situational influence will definitely persist 
if it is negative and the probability of c decreases, or if it is positive and the 
probability of c increases. The reverse holds for the graph depicted in Fig. 8. A 
method for verifying whether or not the sign of a situational influence retains its 
validity thus has to distinguish between the two possible manifestations of the 
underlying non-monotonic influence. 

The sign of the additive synergy involved can now aid in distinguishing be- 
tween the possible manifestations of a non-monotonic influence under study. 
We recall that a positive additive synergy of nodes A and C on their com- 
mon child B indicates that Pr(6 | ac) — Pr(& | ac) > Pr(& | ac) — Pr(b | ac). 
From the influence of A on i? being non- monotonic, we have that the differ- 
ences Pr(6 I ac) — Pi(b | ac) and Pr(& | ac) — Pr(b | ac) have opposite signs. 
A positive additive synergy of A and C on B now implies that the sign of 
Pr(6 I ac) —Pr (6 | ac) must be positive and that the sign of Pr(6 | ac) — Pr(& | ac) 
must be negative, as in Fig. 7. Analogously, a negative additive synergy corre- 
sponds to the manifestation of the non-monotonic influence shown in Fig. 8. 

From the previous observations, we have that the sign of the additive synergy 
involved can be exploited for verifying whether or not the sign of a situational 
influence retains its validity during inference. Suppose that, as in Fig. 6, we have 
5'?(<5 i)(^^ ^) Y^'^{{A,C}, B). Further suppose that new evidence causes a 

change in the probability of c, the direction of which is reflected in sign[C]. Then, 
we can be certain that (5i will remain valid if 

= sign)^] (g) 62 ■ 

Otherwise, (5i has to be changed into ’?’. We can substantiate our statement as 
follows. Abstracting from previously entered evidence, we have that 

Pr(6 I a) — Pr(b | a) = Pr(c) • (Pr(6 | ac) — Pr(b | ac) — Pr(b | ac) -I- Pr(& | ac)) + 
Pr(b I ac) — Pr(b | ac) . 

We observe that the equation expresses the difference Pr(6 | a) — Pr(b | a) as 
a linear function of Pr(c). We further observe that the sign of the gradient of 
this function equals the sign of the additive synergy of A and C on B. Now 
suppose that the probability of c increases as a result of the new evidence, and 
that y+({A, C}, B). Since the gradient then is positive, a positive sign for the 
situational influence will remain valid. If, on the other hand, the probability of 
c increases and Y~{{A, C}, B), then a negative sign for the situational influence 
will remain valid. We conclude that upon an increase of Pr(c), (5i persists if 
^1 = -I- (g) 1 ^ 2 . Otherwise, we cannot be certain of the sign of the situational 
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influence and <5i is changed to Similar observations hold for a decreasing 
probability of c. 

In our analysis so far, we have assumed that the two parents A and C of a 
node B are mutually independent and remain to be so as evidence is entered 
into the network. In general, however, A and C can be (conditionally) dependent. 
Node A then not only influences B directly, but also indirectly through C. The 
situational influence of A on B, however, pertains to the direct influence in 
isolation even though a change in the probability of c may affect its sign. When 
a change in the probability of a causes a change in the probability of c which in 
turn influences the probability of b, the indirect influence on b is processed by 
the sign-propagation algorithm building upon the composition of signs. 

4.2 The Adapted Sign-Propagation Algorithm 

The sign-propagation algorithm for inference with a qualitative network has to be 
adapted to render it applicable to situational qualitative networks. In essence, 
two modifications are required. First, in case of non- monotonicities, the algo- 
rithm must use the signs of the situational influences involved. Furthermore, 
because the sign of a situational influence of a node A on a node B is dynamic, 
its validity has to be verified as soon as an observation causes a change in the 
probability distribution of another parent of B. Due to the nature of sign prop- 
agation, it may occur that a sign is propagated along a situational influence 
between A and B, while the fact that the probability distribution of another 
parent of B changes does not become apparent until later in the propagation. 
It may then turn out that the sign of the situational influence should have been 
adapted and that incorrect signs were propagated. A solution to this problem 
is to verify the validity of the sign of the situational influence as soon as infor- 
mation to this end becomes available; if the sign requires updating, inference 
is restarted with the updated network. Since the sign of a situational influence 
can change only once, the number of restarts is limited. The adapted part of the 
sign-propagation algorithm is summarised in pseudo-code in Fig. 9. 

Example 5. We consider the situational qualitative network from Fig. 10. The 
network is identical to the one shown in Fig. 4, except that it is supplemented 
with a situational sign for the non-monotonic influence of node A on node B. 
Suppose that we are again interested in the effect of observing the value false 
for node A upon the other nodes in the network. Inference starts with node A 
receiving the message ’ and updating its node sign to 0 © — = — . Node A 
subsequently determines the messages to be sent to its neighbours E, B and 
D. To node E, it sends — ©— = +. Upon receiving this message, node E 
updates its node sign to 0 © + = + as before; node E does not propagate the 
message to B. To node B, node A sends the message — ©— = +, using the 
sign of the situational inffuence. Node B updates its node sign to 0 © + = +. 
It subsequently computes the message + © + = + for E. Upon receiving 
this message, node E does not need to change its node sign. Node B does not 
propagate the message it has received from A to node C. To node D, node A 
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procedure Propagate-Sign(<5, trail, to, message) : 
sign[to] t— sign[to] © message-, 
trail t— trail U {to}; 

Determine-Effect-On(Q, to) ; 
for each neighbour Vi of to in Q 
do linksign <— sign of influence between to and V); 
message •(— sign[to] © linksign-, 
if Vi ^ trail and sign[Pi] sign[Vi] © message 
then Propagate-Sign(Q, trail,!} , message) . 

procedure Determine-Effect-On(Q, Vi): 

Nv, ^ {Vj ^Vk\VjG n{Vk) \ {Vi},Vk € a{V),S-^^\Vj,Vk),5j^?}- 
for all Vj 14 e Nv, 
do Verify-Update(S'-(^^(I4, 14)); 
if a (5 changes 

then Q Q with adapted signs; 

return Process-Observation(Q,0, sign). 

Fig. 9. The adapted part of the sign-propagation algorithm 

sends — © + = — . Node D updates its node sign to 0 © — = —.It subsequently 
determines the message — © + = — for node C . Upon receiving this message, 
C updates its node sign to 0 © — = — . The algorithm now establishes that 
node C is a parent of node B which has node A for its other parent, and that 
the influence of node A on B is non-monotonic. Because the node sign of C has 
changed, the validity of the sign of the situational influence of A on B needs to 
be verified. Since — = — © +, the algorithm finds that the sign of the situational 
influence of A on B remains valid. The inference therefore continues. Node C 
sends the message — © — = + to i?. Since node B need not change its node 
sign, the inference halts. The node signs resulting from the inference are shown 
in the network’s nodes in Fig. 10. □ 

Examples 3 and 5 demonstrate that inference with a situational network can 
yield more informative results when compared to a regular qualitative network. 



D A 




Fig. 10. A situational network and its node signs after the observation A = false 
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5 Conclusions and Further Research 

Qualitative probabilistic networks model the probabilistic relationships between 
their variables by means of signs. If such a relationship is non- monotonic, it 
has associated the ambiguous sign even though the influence is always un- 
ambiguous in the current state of the network. The presence of influences with 
ambiguous signs typically leads to ambiguous, and thus uninformative, results 
upon inference. In this paper we extended qualitative networks with situational 
influences that capture qualitative information about the current effect of non- 
monotonic influences. We showed that these situational influences can be used 
upon inference and may then effectively forestall ambiguous inference results. 
Because the signs of situational influences are dynamic in nature, we identified 
conditions under which these signs retain their validity. We studied the dynamics 
of the signs of situational influences in cases where the non-monotonicity involved 
originates from a single variable. The presented ideas and methods, however, are 
readily generalised to cases where the non-monotonicity is caused by more than 
one variable. To conclude, we adapted the existing sign-propagation algorithm 
to situational qualitative networks. 

By introducing situational influences we have, in essence, strengthened the 
expressiveness of a qualitative network. Recently, other research has also focused 
on enhancing the formalism of qualitative networks, for example by introducing 
a notion of strength of influences [5] . In the future we will investigate how these 
different enhancements can be integrated to arrive at an even more powerful 
framework for qualitative probabilistic reasoning. 
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Abstract. This paper describes classification of aerial missions using 
first-order discrete Hidden Markov Models based on kinematic data. Civil 
and military aerial missions imply different motion patterns as described 
by the altitude, speed and direction of the aircraft. The missions are 
transport, private flying, reconnaissance, protection from intruders in 
the national airspace as well as on the ground or the sea. A procedure 
for creating a classification model based on HMMs for this application 
is discussed. An example is presented showing how the results can be 
used and interpreted. The analysis indicates that this model can be used 
for classification of aerial missions, since there are enough differences 
between the missions and the kinematic data can be seen as observations 
from unknown elements, or states, that form a specific mission. 



1 Introduction 

An important matter in surveillance and tracking systems is target classification, 
which aims at recognising or even identifying the targets. 

Classification and tracking are fundamental for obtaining situation aware- 
ness, i.e. understanding what situation that has given rise to the observed data. 
For example, in a military surveillance and tracking system it is of interest to un- 
derstand how threatful a certain situation is. Once situation awareness has been 
obtained different types of decision can be made for the coming time period, for 
exemple concerning sensor management. 

In a multitarget environment an algorithm for automated classification will 
support the operator in the work of handling and prioritising a large number of 
targets. Furthermore, with the help of automated classification, decisions can be 
taken in a shorter time. 

Classification of aerial missions can be performed using different types of 
sensor data such as kinematic data, ESM (Electronic Support Measures) data 
and RCS (Radar Cross Section) signature data. An ESM sensor detects emissions 
from radar sensors and communication systems carried by the targets. RCS 
signatures indicate the shape of the targets. The signatures are obtained from 
analysing the intensity of reflected radar signals. Many targets have unique RCS 
signatures. 
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Classification models based on ESM or RCS data consist of sensor measure- 
ments of specific aircraft and their sensors. On the other hand, military aircraft 
are interested in hiding the identity. Therefore there are numerous of methods 
for misleading an opponent. Since new variants of equipment and new methods 
for misleading are developed continuously the classification libraries need to be 
updated to meet the development. Kinematic data are somewhat more robust in 
this respect. However, misleading with kinematic data can be performed, for ex- 
ample, by flying a small military aircraft as in a transport mission or by emitting 
false targets. 

This paper describes a model for automated classification of civil and military 
aerial missions. The purpose is not first of all to classify the aircraft itself, but 
its mission. Therefore, different missions could be performed by the same type 
of aircraft. The missions analysed are transport, private flying, reconnaissance, 
protection from intruders in the national airspace as well as on the ground or the 
sea. The missions imply different motion patterns in the airspace described by 
kinematic data such as flight altitude, speed and direction. The motion patterns 
are reflected in first-order discrete Hidden Markov Models (HMMs), where each 
mission is represented by a specific HMM. 

A first-order discrete HMM is a stochastic model used for time series analysis. 
The system is observed through observation symbols. Meanwhile the system 
passes through a random walk between states, which can not be observed. In the 
case of aerial-mission classification the observation symbols reflect the aircraft 
behavior via kinematic data. The states are related to different elements of the 
missions, characterised by different motion patterns. 

The theory of HMM was presented in the 1960s and has, since then, been 
used for analysing various types of system to recognise certain patterns. Such 
systems include for example speech [15], word [10] and gesture recognition [9]. 

HMMs have also been used for image analysis. In [12] target classification 
using images from a SAR (Synthetic Aperture Radar) is discussed. 

In [4] the application of HMM to the recognition problem of military columns 
is discussed. The purpose is to recognise ground troop organisations using dif- 
ferent types of data such as human observations and radar sensors data. 

Different statistical methods have been investigated for the problem of air- 
craft classification. For example, in [3] HMMs are used to represent ESM data in 
the form of multiaspect electromagnetic scattering data. In [6] discrete Markov 
models are used to represent different sensor observables such as ESM data, IFF 
(Identification, friend or foe) data and aircraft speed and trajectory profiles. 
In [1] the theory of evidence is used to handle different types of sensor data, 
ambiguities and uncertainties. In [16] Bayes rule is used to update the classifi- 
cation based on different sensor observables. The model distinguishes between 
different military aircraft and is based on flight envelopes and ESM data. In [7] 
an algorithm is developed that incorporates target classification into the tar- 
get tracking process. The integration is accomplished using Bayesian network 
techniques. In doing so the process for associating new observations to existing 
tracks is improved. 
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2 Hidden Markov Models 

In this section some of the basic equations are presented. For a more close de- 
scription of the different steps towards the final equations, the reader is referred 
to [13]. 

A discrete HMM, A, is described by a number of states N and a num- 
ber of discrete observation symbols M. The individual states are denoted 
S = {Si, S 2 , ■■■, Sn} and the individual discrete observation symbols are denoted 
V = {^^ 1 ,^ 2 ) ■■■jVm}' The model A is also described by the following parameter 
set: 



A = {A,B,tt) 


(1) 


A = {aij} = [P[qt+i = Sj\qt = S'^]} 


(2) 


B = {b^{k)} = {P[Ot = Vk\qt = S,]} 


(3) 


7T = {7Ti} = [P[qi = 5j]} 


(4) 



where A represents the state transition probabilities, B represents the prob- 
ability distributions of the discrete observation symbols and tt represents the 
probability distribution of the states initially. Moreover, denotes the prob- 
ability for the system to change from state i to state j and bj{k) denotes the 
probability distribution of the symbols for a certain state j. The current state is 
denoted qt, the current observation is denoted Ot and l<i,j<N,l<k<M. 

The parameters A, B and tt can be estimated by HMM learning. In this 
work, HMM learning is performed using the Baum- Welch method [13]. 

Once A is defined, it can be used to calculate the likelihood L of an observation 
sequence O, given a specific model A, i.e. 

L = P{0\X) (5) 

where O consists of observations registered at T points of time, i.e. O = 
(Oi02---Ot)- In a classification problem the purpose is to compare O to an 
HMM library and to find out what A that gives the highest L value. 

To calculate L the forward algorithm is introduced as follows: 



at{i) = P{0i02...0t,qt = 5j|A) . (6) 

The forward algorithm describes the probability of observing the partial ob- 
servation sequence 0i02 - 0t in state Si at time t, given the model A. 

For the first observation Oi, a is calculated as follows: 

Qfl(f) = 7Ti6j(Ol) . (7) 



For the following observations, a is calculated according to: 



■ N 

,i=l 



bj{Ot+i) 






(8) 
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where 1 < t < T — 1. The equation illustrates how Sj can be reached at time 
t + 1 from the N possible states at time t. The calculation is performed for all 
states j at time t. The calculation is then iterated for t = 1,2, ...,T — 1. The 
final result is given by the sum of the terminal forward variables axii), i-C. 

N 

P{0\X) = Y,aT{i) . (9) 

i=l 



During the calculation procedure there are several multiplications with proba- 
bilities well below 1.0, leading to a final result which is close to 0. Problems 
arise when the result is less than the smallest number that the computer can 
represent. In [13] the problem is solved by introducing the scaling factor ct'. 



Ct = 



1 



(10) 



The final equation, useful for this kind of problem, becomes 



T 

log[P(0|A)] = -^logct . 



( 11 ) 



The procedure for the introduction of the scaling and for coming to Eq. 11 can 
be studied in [13]. 



3 Application on Aerial Missions 

In [2], [8] and [11] some characteristics of aerial missions can be studied. The 
characteristics that were assumed useful for classification are presented below. 

Aircraft performing transport missions fly mainly in straight lines. They fol- 
low so-called standard flight paths. The turns are wide compared to small mil- 
itary aircraft. The cruising speed for bigger passenger aircraft ranges from 500 
to 950 km/h. The cruising level for long distance flights is around 10 km. 

The private aircraft fly mainly at low altitude and with low speed. They have 
probably a greater inclination to deviate from a straight line than the transport 
aircraft have. Smaller private aircraft have a maximum speed of 200 to 300 km/h 
and fly at an altitude of a few hundred meters up to 3 to 5 km at the most. 

Aircraft performing reconnaissance missions fly either at very low or very 
high altitudes. Which altitude that is used depends on type of reconnaissance 
mission. At high altitude a large area will be surveyed. At low altitude a detailed 
part of the area will be surveyed. The aircraft fly mainly in straight lines and 
with low or high speed. 

Small military aircraft have a maximum speed well above 950 km/h. They 
perform missions directed towards different military targets such as targets in the 
air (air target mission), on the ground or the sea (ground/sea target missions). 
There are different ways in performing the missions, depending on the type of 
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target as well as weather and geographical conditions. However, the missions 
are characterised by a fast course of events, high speed, sharp turns and rapid 
ascents. One major difference between the air target mission and the ground/sea 
target missions is that the ground/sea target missions are performed at very low 
altitude. 

For military as well as non-military missions sharp turns may also be observed 
close to airports indicating landing and take-off processes. 

3.1 Multiple Target Tracking Model 

When using kinematic data for classification a first step is to process sensor data 
in a Multiple Target Tracking (MTT) model. The MTT model transforms the 
sensor data into target tracks. Once tracks are formed quantities such as target 
altitude, velocity and direction can be computed. 

The quality of the classification results is dependent not only on the HMMs 
but also on the MTT model. If the target can not be tracked good enough the 
classification results will be poor and unreliable. 

To this stage this classification model has been investigated outside the MTT 
model, using Matlab. However, the next step is to fully integrate it with the 
MTT model. Input data to the classification model from the MTT model will 
be a vector containing target speed in x, y and z directions, flight altitude and 
information on straight line or sharp turn. Output data from the classification 
model back to the MTT model will be a vector containing the likelihood distri- 
bution of the five possible aerial missions. The MTT model is also provided with 
a presentation part that will include the classification results for presentation to 
the operator. 

The MTT model that will serve the classification model with kinematic data 
is described in [14]. Some of its basic functions will be described below. For a 
more close description on the different methods associated with target tracking 
se also [5]. The target tracking in based on Kalman filtering, where sensor and 
environment are modelled statistically. The sensor is represented by a measure- 
ment equation and the movement of the target is represented by a state equation. 
For each target there is an estimated state vector and an estimated covariance 
matrix. The estimated state vector consists of information on target position, 
speed and acceleration calculated from radar data. The covariance matrix is used 
to describe the uncertainty of the state vector estimation. The purpose of the 
Kalman filter is to minimise the mean squared error of model and measurement 
data. 

Data association is an important but complicated procedure in a multi-target 
environment. Data association deals with the problem of pairing the observations 
with the existing tracks maintained by the MTT model. The method used is the 
GNN method (Global Nearest Neighbour). The observation that is closest to the 
predicted position of the track (according to the Kalman filter) is connected to 
the track. In the GNN method, each observation is only connected to one track. 
Observations that are not connected to any tracks can be seen as possible new 
targets for which new tracks are to be constructed. 
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The ability of the MTT model to track maneuvering targets is of vital im- 
portance for what motion patterns that can be described by the classification 
model. The method Interacting Multiple Models (IMM) is used to represent 
maneuverability. The IMM consist of different potential motion models, that 
are combinated according to a Markov model for transitions between the mo- 
tion models. In the MTT model the IMM consist of two motion models. One 
of them describes a motion in a straight line (or almost a straight line). The 
other describes a sharp turn. Consequently, the classification model is in the 
representation of motion patterns restricted to a straight line and a sharp turn. 

An observation in the classification model consists of a combination of speed, 
altitude and direction. The output data from the MTT model must therefore be 
translated into observation sequences suitable for the classification model. This 
is, among other things, discussed in the following section. 



3.2 The Choice of HMM Model and Model Parameters 

The HMM library in the classification model consists of Xtr (transport), Xpr 
(private flying), Xre (reconnaissance). Atari (air target), and Afar 2 (ground/sea 
target). 

A flight is here seen as consisting of two states {N = 2). Each state consists 
of a specific collection of straight lines and sharp turns at different speeds and 
altitudes. The states should not be mixed up with the two motion models in the 
IMM. The motion models reflect only the direction and say nothing about speed 
and altitude. 

State Si represents flying the aircraft in predominantly straight lines and 
with very few sharp turns. It is more frequently observed at the higher altitude 
intervals. This is of course the dominating state in the transport or reconnais- 
sance missions but it also appears, for example, before and after a military attack. 
State Si can be said to reflect regular motion patterns. State S 2 represents flying 
the aircraft more frequently in sharp turns but also periodically in straight lines. 
It is typical during the attack and includes sharp turns, rapid ascents at high 
speed. The behaviour can also be observed for military as well as non-military 
aircraft close to airports indicating landing and take-off processes, at somewhat 
lower speed. This behaviour is probably more frequent at the lower altitudes. In 
contrast to Si, S 2 can be said to reflect irregular motion patterns. 

The discrete symbol Vk is a combination of speed, altitude and direction 
according to Table 1. For this application M = 10, i.e. V = {vi,V 2 , ...,wio}. In 
the table h represents altitude intervals, vel represents speed intervals and m 
represents directions. The altitude is divided into three intervals to distinguish 
between low and high flying aircraft. The interval hi ranges from 0 to 0.7 km, 
ft -2 ranges from 0.7 to 15 km and /13 starts from 15 km and with no specific end 
specified. 

The speed intervals are used to distinguish between military aircraft with 
high speed performance from other aircraft. The interval veli ranges from 0 to 
950 km/h and vel 2 starts from 950 km/h and with no specific end specified. 
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Fig. 1. Characteristic motion patterns in two dimensions for the private aircraft, re- 
connaissance and ground/sea target missions 
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Aircraft that fly at very high altitude have no division in speed intervals. The 
speed at very high altitude is denoted vel^. 

Finally mi and m 2 are associated with the motion models in the IMM. 
The direction mi represents a straight line (or almost a straight line) and m2 
represents a sharp turn. 

For example, v\ implies an aircraft flying in a straight line, at an altitude 
somewhere between 0 and 0.7 km and with a speed somewhere between 0 and 
950 km/h. 

Since we do not have real measurement data from different missions, the 
HMM learning is based on artificial data that have been created using infor- 
mation from [2], [8] and [11]. The learning process needs initial values for 0^, 
bj{k) and tt^. Initial values for and tt^ were created using uniform distribu- 
tions. Initial values for bj(k) were estimated using a priori information about the 
missions, as described above. 

Totally around 100 artificial observations per mission were used as input data 
to the HMM learning. The greater part of the training data are selected to reflect 
specific characteristics of each mission. However, those parts of the missions that 
represent more normal motion patterns are also represented. Figure 1 illustrates 
some characteristic motion patterns in two dimensions for the private flying, 
reconnaissance and ground/sea target missions. Table 2 presents bi{k) and 62 (fc) 
obtained from HMM learning. 

Table 3 presents a^- obtained from HMM learning. For example, the proba- 
bility for an aircraft performing a transport mission to continue to fly in regular 
motion patterns is 0.72. The probability for changing to irregular motion pat- 
terns is 0.28. 
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Table 1. Definition of the discrete symbols V 
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Table 2. Observation probabilities obtained from HMM learning 



(b,(fc))„ / k 
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2 


3 


4 
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6 


7 


8 


9 


10 


{bl{k))tr 


0.01 


0.04 


0.72 


0.02 


0.03 


0.03 


0.03 


0.04 


0.03 


0.04 


{b2{k))tr 


0.11 


0.07 


0.32 


0.09 


0.07 


0.07 


0.07 


0.07 


0.07 


0.06 


(6i {k'j'jpr 


0.47 


0.15 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


(^2(^))pr 


0.16 


0.46 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


0.05 


(6l {k'j 'jre 


0.37 


0.03 


0.04 


0.04 


0.34 


0.03 


0.04 


0.04 


0.04 


0.04 


{b2{k))re 


0.21 


0.08 


0.07 


0.07 


0.24 


0.08 


0.07 


0.06 


0.06 


0.06 


(6l {ky'jtarl 


0.07 


0.07 


0.16 


0.13 


0.07 


0.07 


0.14 


0.14 


0.07 


0.07 


(^2(^))tarl 


0.07 


0.07 


0.16 


0.13 


0.07 


0.07 


0.14 


0.14 


0.07 


0.07 


(6l {k') ')tar2 


0.22 


0.11 


0.06 


0.06 


0.06 


0.06 


0.06 


0.06 


0.22 


0.11 


(^2(^))tar2 


0.11 


0.23 


0.05 


0.05 


0.05 


0.05 


0.05 


0.06 


0.11 


0.22 



Table 3. State transition probabilities obtained from HMM learning 





ail 


ffll2 


021 


022 


Atr 


0.72 


0.28 


0.53 


0.47 


Apr 


0.53 


0.47 


0.48 


0.52 


Are 


0.62 


0.38 


0.50 


0.50 


Atari 


0.50 


0.50 


0.50 


0.50 


Atar2 


0.45 


0.55 


0.55 


0.45 



The time aspect of the system is reflected by the A matrix. If the time period 
between observations is short, the probability for the system to change states 
will be small. On the other hand, if the time period is longer, the probability 
of changing states will be higher. In the MTT model the tracking information 
is updated once per second and consequently the classification information can 
also be updated once per second. It is however possible to vary the measurement 
time period in the MTT model. If that is utilised, it must be possible to change 
the values of A in order to correctly reflect possible changes of the states. 
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4 Some Reflections on the Classiflcation Procednre 

In the following classification example some reflections on the classification pro- 
cedure will be discussed. 

In this example the target tracking system is assumed to track an aircraft 
whose mission is to protect the national airspace from intruders (air target mis- 
sion). To start with the target gives rise to the following observations recorded 
during 5 seconds: altitude 8 km, speed < 950 km/h and a straight line. These 
observations correspond to V 3 in Table 1. During the next 9 seconds the speed 
is increased so that it exceeds 950 km/h (vr). The target then deviates from the 
straight line (vs) and performs a sharp turn during 3 seconds. At the end of the 



observation period (3 seconds) it returns to the straight line (vy). The motion 
pattern is illustrated by Eqs. 12 - 15. 

Oa = Oi O2 O3 O4, O5 ^ {V3 V3 V3 V3 V3) ( 12 ) 

Ob = Oq O7 Os Oq Oio (v7 V’j V’j V’j W7) (13) 

Oc = Oil O12 O13 Oi4 O15 (V’J V’J vj V’J Vs) (14) 

Od = O16 Oi7 O18 Oi9 O20 (vs Vs V’J V’J V’j) . (15) 

A new likelihood calculation is performed every fifth second according to 
Eqs. 16 - 19. This is to illustrate how the target unveils its characteristic features 
as time goes. The results are presented in Table 4. 

Out = Oa (16) 

02nd = 0a+0b (17) 

Osrd = Oa + Ob + Oc (18) 

Oith = Oa + Ob + Oc + Od ■ (19) 



Table 4. Classification results for Oist — Onh 



o 


log[P(0|At.)] 


log[P(0|Ap.)] 


log[P(0|A„)] 


l0g[P(0|Atarl)l 


l0g[P(0|Atar2)j 


Olst 


-4.8 


-15.2 


-15.7 


-9.2 


-14.5 


02nd 


-18.6 


-30.4 


-31.3 


-19.0 


-28.9 


O^rd 


-32.4 


-45.7 


-46.9 


-28.7 


-43.4 


Oith 


-46.3 


-60.9 


-62.6 


-38.4 


-57.8 



After the first short sequence 0\st, the model suggests that the target per- 
forms a transport mission, since log[P(Oist|Air)] has the highest value. If the 
number of observations is increased to 02 nd the likelihood for a transport mis- 
sion is reduced. At this stage log[P( 02 nd\^tr)\ ~ log [P(02nd| Atari)]- When a 
sharp turn at high speed is observed (Osrd and Ouh) the most likely mission is 
instead the air target mission. 
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Table 5. Classification results for O^th 



o 


log[P(0|At.)] 


log[P(0|Ap.)l 


log[P(0|A.e)] 


log[P(0|Atarl)] 


log[P{P\\tar 2 )\ 


Ohth 


-47.7 


-72.3 


-78.3 


-51.1 


-76.2 



Assume that the tracking of the target is continued and that the following 
is registered for 5 seconds: altitude 8 km, speed < 950 km/h and a straight 
line (us). The motion pattern according to Oe is added to the analysis and the 
likelihood of O^th is calculated. The result is presented in Table 5. 

Oe = O21 O22 O23 O24 O25 => (vs V3 V3 V3 V3) (20) 

O^th = Oa + Oh + Oe + Od + Oe ■ ( 21 ) 

As can be seen the classification model suggests again that the most likely 
mission is the transport mission. 

According to this example there are important motion patterns that may 
be hidden if O includes too many observations. These motion patterns last for 
relatively short time periods and they represent, for example, the attacks in 
the air and ground/sea target missions. To be able to observe these specific 
motion patterns the likelihood calculations should be based on relatively few 
observations, perhaps < 5 observations at a time. The optimal length of O will 
presumably vary depending on the prevailing situation. 



5 Discussion and Further Work 

As mentioned earlier the classification model will be integrated into the MTT 
model. To run the classification model together with the MTT model will most 
certainly help in bringing suggestions for improvements of the classification 
model. For example, is the model too rough, should there be more states and 
observations to be able to classify the aerial missions in a satisfying way. 

Another problem to deal with is the risks for misclassifications, i.e. to classify 
a non-military mission as a military mission and vice versa. At present the model 
consists of only five missions and their representations in the HMMs are probably 
not complete. 

It should be investigated more closely the behaviour of the aircraft at different 
parts of the missions so that no important parts are omitted in the HMMs. This 
is also associated with the HMM learning and the problem of getting input data 
of good quality for the learning process. It is for example difficult to get real 
radar data on the different missions and especially the characteristic parts of 
each mission. 

Since this type of classification is closely related to the target tracking pro- 
cess, problems associated with target tracking will also be associated with classi- 
fication. Problems in target tracking appear, for example, in environments that 
include a large number of targets. In such environments it is difficult to associate 
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the correct observations with the correct tracks. If the association is incorrectly 
performed, strange tracks will appear and misleading classification will be per- 
formed. 

When classifying aerial missions there is, except for the information based 
on kinematic data, another kind of characteristic associated with the missions, 
namely the number of aircraft performing the same mission. For example, the 
air and ground/sea target missions are usually performed by a group of aircraft. 
The group performing the air target mission usually consists of a smaller number 
of aircraft compared to the group performing the ground/sea target missions. 
The transport and reconnaissance missions are usually performed by a single 
aircraft. By fusing the information about the number of aircraft with the in- 
formation given from the HMMs based on kinematic data, the classification of 
aerial missions could be improved further. 
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Abstract. This paper proposes a new version of importance sampling 
propagation algorithms: dynamic importance sampling. Importance 
sampling is based on using an auxiliary sampling distribution. The 
performance of the algorithm depends on the variance of the weights 
associated with the simulated configurations. The basic idea of dynamic 
importance sampling is to use the simulation of a configuration to 
modify the sampling distribution in order to improve its quality and so 
reducing the variance of the future weights. The paper shows that this 
can be done with little computational effort. The experiments carried 
out show that the final results can be very good even in the case that 
the initial sampling distribution is far away from the optimum. 

Keywords: Bayesian networks, probability propagation, approxi- 
mate algorithms, importance sampling 



1 Introduction 

This paper proposes a new propagation algorithm for computing marginal con- 
ditional probabilities in Bayesian networks. It is well known that this is an NP- 
hard problem even if we only require approximate values [7]. So, we can always 
find examples in which polynomial approximate algorithms provide poor results, 
mainly if we have extreme probabilities: There is a polynomial approximate al- 
gorithm if all the probabilities are strictly greater than zero [8]. 

There are deterministic approximate algorithms [1,2,3,4,5,13,16,20,21] and 
algorithms based on Monte Carlo simulation with two main approaches: Gibbs 
sampling [12,15] and importance sampling [8,10,11,18,19,22]. 
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A class of these heuristic procedures is composed by the importance sampling 
algorithms based on approximate pre-computation [11,18,19]- These methods 
perform first a fast but non exact propagation, following a node removal process 
[23]. In this way, an approximate ‘a posteriori’ distribution is obtained. In a 
second stage a sample is drawn using the approximate distribution and the 
probabilities are estimated according to the importance sampling methodology. 

In this paper we start off with the algorithm based on approximate pre- 
computation developed in [18]. One of the particularities of that algorithm is 
the use of probability trees to represent and approximate probabilistic poten- 
tials. Probability trees have the possibility of approximating in an asymmetrical 
way, concentrating more resources (more branching) where they are more neces- 
sary: higher values with more variability (see [18] for a deeper discussion on these 
issues). However, as pointed out in [5], one of the problems of the approximate 
algorithms in Bayesian networks is that sometimes the final quality of an approx- 
imate potential will depend on all the potentials, including those which are not 
necessary to remove a variable. Imagine that we find that after deleting variable 
Z, the result is a potential that depends on variable X, and we find that this 
dependence is meaningful (i.e. the values of the potential are high and different 
for the different cases of A). If there is another potential not considered at this 
stage, in which all the cases of X except one have assigned a probability equal to 
zero, then the discrimination on X we have done when deleting Z is completely 
useless: finally only one value of X will be possible. This is an extreme situation, 
but illustrates that even if the approximation is carried out in a local way, the 
quality of the final result will depend on the global factors. There are algorithms 
that take into account this fact, as Markov Chain Monte Carlo, the Penniless 
propagation method presented in [5], and the Adaptive Importance Sampling 
(AIS-BN) given in [6]. 

In this work, we improve the algorithm proposed in [18] allowing to modify 
the approximate potentials (the sampling distribution) taking as basis the sam- 
ples we are obtaining during the simulation. If samples with very small weights 
are drawn, the algorithm detects the part of the sampling distribution (which 
is represented as an approximate probability tree) which is responsible of this 
fact, and it is updated in such a way that the same problem will not appear in 
the next simulations. Actually, this is a way of using the samples to obtain the 
necessary information to improve the quality of the approximations taking into 
account other potentials in the problem. Trees are very appropriate for this as 
they allow to concentrate more efforts in the most necessary parts: the config- 
urations that were more frequently obtained in past simulations and for which 
the approximation was not good. 

The paper is organised as follows: in Sect. 2 it is described how probability 
propagation can be carried out using the importance sampling technique. The 
new algorithm called dynamic importance sampling is described in Sect. 3. In 
Sect. 4 the performance of the new algorithm is evaluated according to the 
results of some experiments carried out in large networks with very poor initial 
approximations. The paper ends with conclusions in Sect. 5. 
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2 Importance Sampling in Bayesian Networks 

Throughout this paper, we will consider a Bayesian network in which X = 
{Xi, . . . , Xn} is the set of variables and each variable Xi takes values on a finite 
set Hi- If J is a set of indices, we will write X/ for the set {Xi\i G /}, and 17/ 
will denote the Cartesian product Given x G 17/ and J Q I, nj is the 

element of 17/ obtained from x by dropping the coordinates not in J. 

A potential / defined on 17/ is a mapping / : 17/ — >■ Kj, where is the set of 

non-negative real numbers. Probabilistic information will always be represented 
by means of potentials, as in [14]. The set of indices of the variables on which a 
potential / is defined will be denoted as dom(/). 

The conditional distribution of each variable Xi, i = 1, . . . , n, given its par- 
ents in the network, Xpa(i), is denoted by a potential Pi{xi\yCpa{i)) where pi is 
defined over f2{i}upa(i)- If iV = {1, . . . ,n}, the joint probability distribution for 
the n-dimensional random variable X can be expressed as 

P(x) = Vx G 17jv . (1) 

i£N 

An observation is the knowledge about the exact value Xi = of a variable. 
The set of observations will be denoted by e, and called the evidence set. E will 
be the set of indices of the variables observed. 

The goal of probability propagation is to calculate the ‘a posteriori’ proba- 
bility function p{x'f.\e), xj. G 17^, for every non-observed variable A^,, k G N\E. 
Notice that p{x'i.\e) is equal to p{x'f., e)/p{e), and, since p{e) = p(xj.,e), 

we can calculate the posterior probability if we compute the value p(x'j.,e) for 
every xj. G f2k and normalise afterwards. 

Let H = {pi{xi\'Xpa(i))\i = 1,... ,n} be the set of conditional potentials. 
Then, p{x'f.,e) can be expressed as 

p{x'k,e)= ^ Yi Piix^\l^paii)) = n '^(^dom(/)) (2) 

xGS2jv iGN xGOn fGH 

XE=e XE=e 

Xfc=a:'j, ^k=x'^ 

If the observations are incorporated by restricting potentials in iJ to the 
observed values, i.e. by transforming each potential f G E[ into a potential fe 
defined on dom(/) \ A as /e(x) = /(y), where ydom(f)\E = x, and pi = a, for 
all i G E, then we have, 

P(4>e)= n /e(^dom(/e)) = 

where g(x) = fei^dom{f,))- 

Thus, probability propagation consists of estimating the value of the sum in 
(3), and here is where the importance sampling technique is used. Importance 
sampling is well known as a variance reduction technique for estimating integrals 
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by means of Monte Carlo methods (see, for instance, [17]), consisting of trans- 
forming the sum in (3) into an expected value that can be estimated as a sample 
mean. To achieve this, consider a probability function p* : — >■ [0, 1], verifying 

that _p*(x) > 0 for every point x G Qn such that g{yi) > 0. Then formula (3) 
can be written as 






E 

g(x)>0 



g(x) 

p*(x) 



p*(x) = E 



' g(X*) ' 
p*{X*) 



where X* is a random variable with distribution p* (from now on, p* will be 
called the sampling distribution). Then, if is a sample of size m taken 

from p * , 



P(x'k,e) 



)_^g{^ 

m p*(xO) 



(4) 



is an unbiased estimator of p{x'f.,e) with variance 



Var(p(x'fc,e)) 




The value Wj = g{x^^^)/p*{ x^^'>) is called the weight of configuration x^d\ 

Minimising the error in unbiased estimation is equivalent to minimising the 
variance. As formulated above, importance sampling requires a different sample 
to estimate each one of the values x). of A^,. However, in [18] it was shown 
that it is possible to use a unique sample to estimate all the values x'j,. In such 
case, the minimum variance is reached when the sampling distribution p*{x) is 
proportional to g(x), and is equal to: 

Var(p(4|e)) = —{p{x'^\e){l - p{x'^\e)) . 
m 

This provides very good estimations depending on the value of m (analo- 
gously to the estimation of binomial probabilities from a sample), but it has the 
difficulty that it is necessary to handle p{x\e), the distribution for which we want 
to compute the marginals. Thus, in practical situations the best we can do is to 
obtain a sampling distribution as close as possible to the optimal one. 

Once p* is selected, p(x)., e) for each value x). of each variable Xk, k G N\E 
can be estimated with the following algorithm: 



Importance Sampling 

1. For j := 1 to TO (sample size) 

a) Generate a configuration x^^^ G 17 at using p*. 

b) Calculate the weight: 

TJfefffe (^dom(fe ) ) 
■“ p*(xO)) 



( 5 ) 
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2. For each x'^. G fik, k £ N\E, estimate p{x'^,e) as the average of the weights 
in formula (5) corresponding to configurations containing x'f.. 

3. Normalise the values p{x'f.,e) in order to obtain p{x'j.\e). 

The sampling distribution for each variable can be obtained through a process 
of eliminating variables in the set of potentials H . An elimination order a is 
considered and variables are deleted according to such order: Ao.(i), . . . , 

The deletion of a variable consists of marginalising it out from the 

combination of all the functions in H which are defined for that variable. More 
precisely, the steps are as follows: 

~ Let = {/ G H\a{i) G dom(/)}. 

- Calculate fa(i) = ]l/Gi/„(i) / defined on dom(/,^p)) \ {a{i)}, by 

- Transform H into H \ U {/^(j)}- 

Simulation is carried out in order contrary to the order in which variables are 
deleted. To obtain a value for we will use the function /g.(j) obtained in 

the deletion of this variable. This potential is defined for the values of variable 
and other variables already sampled. Potential is restricted to the 
already obtained values of variables in dom(/CT(i))\{cr(f)} giving rise to a function 
which depends only on Finally, a value for this variable is obtained with 

probability proportional to the values of this potential. If all the computations 
are exact, it was proved in [11] that we are really sampling with the optimal 
probability p*(x) = p(xje). But the result of the combinations in the process 
of obtaining the sampling distributions may require a large amount of space 
to be stored, and therefore approximations are usually employed, either using 
probability tables [11] or probability trees [18] to represent the distributions. 
Instead of computing the exact potentials we calculate approximate ones but 
with much fewer values. Then the deletion algorithm is faster and the potentials 
need less space. But sampling distribution is not the optimal one and the quality 
of estimations will depend on the quality of approximations. 

In [11] an alternative procedure to compute the sampling distribution was 
used. Instead of restricting /^(i) to the values of the variables already sampled, 
all the functions in iLo-(i) ^tre restricted, resulting in a set of functions depending 
only on Ag.(i). The sampling distribution is then computed by multiplying all 
these vectors. If computations are exact, then both distributions are the same, 
as restriction and combination commute. When the combinations are not exact, 
generally the option of restricting /c,(i) is faster and the restriction of functions 
in iLg.(i) is more accurate, as there is no need to approximate in the combination 
of functions depending only on one variable . 

3 Dynamic Importance Sampling 

Dynamic importance sampling follows the same general structure as our previous 
importance sampling algorithms but with the difference that sampling distribu- 
tions can change each time that a configuration x^-l^ is simulated. The algorithm 
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follows the option of restricting the functions in -ffo-(i) before combining the 
functions to obtain the distribution for . 

Assume that we have already simulated the values cf = ■ ■ ■ , 

(7) 

and that we are going to simulate a value xj^ -^ for Ao-(i). Let us denote by 

the result of restricting the potential / to the values of c^, and let be the 
function that was computed when removing variable Ag.(i) in the elimination 
algorithm. Note that this function is contained in H. For the simulation of value 
^a{i) compute the following elements: 

— = {f^j 1/ G the result of restricting all the functions in 

to the values already simulated. 

~ Qa(i): the result of the the combination of all the functions in {Ha-(i))^3- 
This is a vector depending only on variable Aio.(i). A case for this variable 
is obtained by simulation with probability proportional to the values of this 
vector. 

— bcr(i) = <lcr(i)(xcr(i)) (the normalisation value of vector qcr(t))- 

— acr(i), equal to the value of potential when instantiated for the cases in 
vector cl. 

If all the computations are exact (i.e. the trees representing the potentials 
have not been pruned during the variable elimination phase), then 6o-(i) must 
be equal to a£,(i). &cr(i) is obtained by restricting to . . . , 

the potentials in Ha-(i), combining them, and summing out variable Xg.(i); while 
Oc,(i) is the result of combining potentials in summing out and 

restricting the result to c^. This is clear if we notice that /F-, is the result 
of combination of potentials in iLCT(i) ^md then summing out In other 

words, we do the same operations but with the difference that restriction to 
configuration c^ is done at the beginning for 6o.(i) and at the end for aa-(i). If the 
computation had been exact the results should be the same, but these operations 
do not commute if the potentials involved have been previously pruned. 6g.(i) is 
the correct value and ag.(i) the value that can be found in potential This 
potential is the one that has been used to compute the sampling probabilities of 
variables Therefore, if and a^jy) are very different, it 

means that configuration c{ is being drawn with a probability of occurrence far 
away from its actual value. The worst case is when is much greater than 
b^yy For example, imagine in an extreme situation that b„y^ is equal to zero 
and ao-(i) is a large value. Then we would be obtaining, with high probability, a 
configuration that should never be drawn (its real probability is zero)^. This is 
very bad, because the weights of all these configurations will be zero and will be 
completely useless. If instead of zero values the real probability had been very 

^ If we had stored in the exact value (zero), then, as this value is used to simulate 
the values of (Xf^yy . . . ,X„yyj), the probability of this configuration should have 
been zero. 
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small, we would have a similar situation, but now the weights would be very 
small, and the real impact of these configurations in the final estimation would 
be very small as well. Summing up, we would be doing a lot of work with very 
little reward. 

Dynamic importance sampling computes the minimum of the values 
0'a(i)/brj{i) and ba(i)/aa(i), considering that this minimum is one if = 0. 
If this value is less than a given threshold, then potential is updated to 

the exact value ba(i) for the given configuration c{ = ■ ■ ■ , This 

potential will be used in the next simulations, and thus will be drawn with a 
more accurate probability in the future. 

The updating of the potential is not simply to change the value a£,(i) by the 
new value The reason is that a single value on a tree affects to more than 
one configuration (if the branch corresponding to that configuration has been 
pruned) and then we may be changing the values of other configurations different 
to c^. If = 0, we could even introduce zeros where the real exact value is 
positive, thus violating the basic property of importance sampling which says 
that any possible configuration must have a chance to be drawn. In order to keep 
this property, we must branch the tree representing in such a way that we 
do not change its value for configurations for which &o-(i) is not necessarily the 
actual value. Therefore, the basic problem is to determine a subset of variables 
. . . , for which we have to branch the node of the tree associated 

to so that only those leaves corresponding to the values of these variables 

in c( are changed to the new value. 

The first step is to consider the subset of active variables, Acr(i) associated 
with the potential This set is computed during the variable elimination 

phase. Initially, 41^(1) is the union of all the domains of potentials in minus 
Xjrp). But if a variable, Xj, can be pruned without error from (i.e. for every 
configuration of the other variables, is constant on the values of ^cr(i)) and 
all the potentials in containing this variable have been calculated in an 

exact way (all the previous computations have only involved pruning without 
error) then Xj can be removed from Though this may seem at first glance 

a situation difficult to appear in real examples, it happens for all the variables 
for which there are not observed descendants [18]. All these variables can be 
deleted in an exact way by pruning the result to the constant tree with value 
1.0 and this provides an important initial simplification. 

Taking ^^(i) as basis, we consider the tree representing and follow the 

path corresponding to configuration (selecting for each variable in a node the 
child corresponding to the value in the configuration) until we reach for a leaf. 
Let .Bcr(i) be the set of all the variables in Acr(p which are not in the branch 
of the tree leading to the leaf node, L, that we have reached. The updating is 
carried out according to the following recursive procedure: 

Procedure Update 

1. If =0, 

2. Assign the value 6^(1) to leaf L 
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3. Else 

4. Select a variable Y G 

5. Remove V from 

6. Branch Lhj Y 

7. For each possible value y oiY 

8. If y is not the value of F in 

9. Make the child corresponding to y be a leaf with value acr(i) 

10. Else 

11. Let Ly the child corresponding to value y 
0.4 

0.4 



0.4 0.6 

Fig. 1. Example of tree updating 




In this algorithm, branching a node by a variable Y consists of transforming 
it into an interior node with a child for each one of the values of the variable. 
Imagine the case of Fig. 1. In which we have arrived to the leaf in the left with 
a value of ao-(i) = 0.4 and that the variables in are X,Y and Z, each one 
of the them taking values in {0, 1} and that the values of these variables in the 
current configuration are 1, 0 and 1 respectively, and that we have to update the 
value of this configuration in the tree to the new value ^^(i) = 0.6. The result is 
the tree in the right side of the figure. 

Though, with this algorithm different configurations in the sample are de- 
pendent, as all the simulations are unbiased (we never introduce a zero value 
when the true probability is different from zero) and the expectation is always 
additive, we have that the resulting estimator p(x'f.,e) is unbiased. 

4 Experimental Evaluation of the New Algorithm 

The performance of the new algorithm has been evaluated by means of sev- 
eral experiments carried out over two large real-world Bayesian networks. The 
two networks are called pedigree4 (441 variables) and munin2 (1003 variables). 
The networks have been borrowed from the Decision Support Systems group at 
Aalborg University (Denmark) (www.cs.auc.dk/research/DSS/misc.html). 

The dynamic importance sampling algorithm, denoted by (dynamic is) has 
been compared with importance sampling without this feature (is), using the 
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same implementation as in [18]. The new algorithm has been implemented in 
Java, and included in the Elvira shell (leo.ugr.es/~elvira) [9]. 

Our purpose is to show that dynamic is can have a good performance even in 
the case that initial approximations are very poor. Thus, in the computation of 
the sampling distributions we have carried out a very rough approximation: In 
all of the experiments the maximum potential size has been set to 20 values, and 
the threshold for pruning the probability trees has been set to e = 0.4. This value 
of e indicates that the numbers in a set of leaves of the tree whose difference 
(in terms of entropy) with respect to a uniform distribution is less than a 40% 
are replaced by their average (see [18]). This is a very poor approximation and 
implies that it is possible to obtain configurations with very low weights with 
high probability which will give rise a high variance of the estimator. 

The experiments we have carried out consist of 20 consecutive applications 
of the dynamic is algorithm. The first application uses the approximate poten- 
tials computed when deleting variables. We consider a threshold to update the 
potentials of 0.95 (see Sect. 3). In each subsequent application of the algorithm 
we start with the potentials updated in the previous application. In this way, we 
expect to have better sampling distributions each time. 

The sample size in each application is very small (50 configurations). We 
have chosen such a small sample size in order to appreciate the evolution of 
the accuracy of the sampling distributions in each of the 20 applications of the 
algorithm. The behaviour of the dynamic algorithm is so good that choosing a 
larger sample (for instance, with 2000 individuals) the difference among the 20 
runs of the algorithm would not be significant, because in the first sample, the 
algorithm is able to find sampling distributions very close to the optimal. 

The accuracy of the estimated probability values is measured as the mean 
squared error (denoted by MSE in Fig. 2). Due to the small sample size the 
variance of the errors is high and therefore we have repeated the series of ap- 
plications a high number of times, computing the average of the errors in all of 
them to reduce the differences due to randomness. 

The experiments have been carried out on a Pentium 4, 2.4 GHz computer, 
with 1.5 GB of RAM and operating system Suse Linux 8.1. The Java virtual 
machine used was Java 2 version 1.4.1. The results of the experiments are re- 
ported in Fig. 2 where the error (MSE) is represented as a function of the number 
of applications of the dynamic is algorithm (from 1 to 20). The horizontal line 
is the optimum error: the error that is obtained when the optimum sampling 
distribution is used (the variable elimination phase is carried out without ap- 
proximations) and with the same parameters as dynamic is, i.e. sample size 50, 
maximum potential size 20. 

The accuracy of the iis algorithm described in [18] is far away from the 
accuracy of dynamic is. With similar computing times, the MSE for is are 0.22 
with the pedigree4 network and 0.14 with the munin2 network, whilst the worst 
errors reached by dynamic is are 0.045 and 0.034 respectively. 
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Fig. 2. Evolution of the error in networks munin2 network (left) and pedigree4 (right) 



4.1 Results Discussion 

The experiments show that even with a very bad initial sampling distribution, 
dynamic is updates the approximate potentials towards potentials with a perfor- 
mance close to the exact ones, just after simulating a few configurations. The 
updating is very fast at the beginning, but afterwards the improvement is very 
slow. This fact agrees with the results of experiments reported in [20], in which it 
is shown that in general the mass of probability is concentrated in some few con- 
figurations. When the sampling probability is updated for these configurations, 
then the performance is good. To achieve the accuracy of the exact distribution 
we need to update a lot of configurations with little mass of probability. This 
is a slow process. We have observed that initially the updating of a potential 
is very frequent, but after a few iterations, then updating of a potential seldom 
occurs. Another important fact is that updating is propagated: If we update a 
potential, this new potential will be the one that will appear associated with the 
variables that are deleted afterwards. Then, the new potential will be the one 
considered when the condition for updating is evaluated. This usually gives rise 
to new updates. 

The updating of potentials does not convey an important increase in time. 
The dynamic algorithm is slower that is during the first iterations, but very 
quickly it becomes faster as the sampling distributions are more accurate and 
the updating procedure is rarely called. In fact, the only important additional 
step is the restriction of potentials in ^ind the combination of them. The 

restriction of each one of the potentials has a complexity proportional to the 
number of variables in it. As the resulting potentials depend only on variable 
Xa-(i), the complexity of combination is proportional to the number of cases of 
this variable. 

5 Conclusions 

We have introduced a modification over importance sampling algorithms for 
probabilistic propagation in Bayesian networks, consisting of the updating of 
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the sampling distribution taking as basis the configurations we are obtaining 
during the simulation. This allows, with little additional time, to obtain good 
quality sampling distributions even if the initial ones are bad. Dynamic (or adap- 
tive) sampling algorithms are not new within the context of Bayesian networks. 
Perhaps the most known case is AIS-BN [6]. However, the use of probability 
trees makes the convergence much faster (in experiments in [6] thousands of 
configurations are considered). 

In the future, we plan to modify the dynamic is algorithm to carry out the 
updating in a first stage, changing to is afterwards. For this, we should determine 
a point in which updating no longer provides benefit because it occurs very rarely, 
for configurations of little probability which therefore will appear in very few 
occasions afterwards. But perhaps, the most important study will be to evaluate 
until which point it is worthy to make more effort in the initial approximation 
or it is better to make a very bad approximation at the beginning leaving to the 
updating phase the responsibility of computing better sampling distributions. 
The results of our experiments indicate that surely the second option will be 
better, but more extensive experiments comparing both options will be necessary 
to give a more founded answer. 
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Abstract. The Hugin and Shenoy-Shafer architectures are two varia- 
tions on the jointree algorithm, which exhibit different tradeoffs with 
respect to efficiency and query answering power. The Hugin architecture 
is more time-efficient on arbitrary jointrees, avoiding some redundant 
computations performed by the Shenoy-Shafer architecture. This effi- 
ciency, however, comes at the price of limiting the number of queries the 
Hugin architecture is capable of answering. In this paper, we present a 
simple algorithm which retains the efficiency of the Hugin architecture 
and enjoys the query answering power of the Shenoy-Shafer architecture. 



1 Introduction 

There are a number of algorithms for answering queries with respect to Bayesian 
networks. Among the most popular of these are the algorithms based on jointrees, 
of which the Shenoy-Shafer [10] and Hugin [4,3] architectures represent two 
prominent variations. While superficially similar, these architectures differ in 
both efficiency and query answering power. Specifically, the Hugin architecture is 
faster on arbitrary jointrees, but the Shenoy-Shafer architecture results in more 
information and can answer more queries. This paper presents an architecture 
that combines the best of both algorithms. In particular, we show that a simple 
modification to the Hugin architecture which does not alter its time and space 
efficiency, allows it to attain the same query answering power exhibited by the 
Shenoy-Shafer. 

This paper is structured as follows. Section 2 reviews the definition of join- 
trees, and details the Shenoy-Shafer and Hugin architectures. Section 3 intro- 
duces the main idea of the paper, and provides a corresponding algorithm which 
can be thought of as either a more efficient Shenoy-Shafer architecture or a more 
expressive Hugin architecture. It also details the semantics of messages and data 
maintained by the new algorithm. Section 4 closes the paper with some conclud- 
ing remarks. Proofs of all theorems are delegated to the appendix. 

2 Jointree Algorithms 

We review the basics of jointrees and jointree algorithms in this section. Let B 
be a belief network. A jointree for is a pair (T, C), where T is a tree and C is a 
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function that assigns a label Ci to each node i in tree T. A jointree must satisfy 
three properties: (1) each label Ci is a set of variables in the belief network; (2) 
each network variable X and its parents U (a family) must appear together in 
some label C^; (3) if a variable appears in the labels of nodes i and j in the 
jointree, it must also appear in the label of each node k on the path connecting 
them. The label of edge ij in tree T is defined as S^- = 0^0 Cj. The nodes of 
a jointree and their labels are called clusters. Moreover, the edges of a jointree 
and their labels are called separators. The width of a jointree is defined as the 
number of variables in its largest cluster minus 1. 

Jointree algorithms start by constructing a jointree for a given belief network 
[10,4,3]. They associate tables (also called potentials) with clusters and separa- 
tors.^ The conditional probability table (CPT) of each variable X with parents 
U, denoted 9x\\j, is assigned to a cluster that contains X and U. In addition, a 
table over each variable X, denoted Ax and called an evidenee table, is assigned 
to a cluster that contains X . Evidence e is entered into a jointree by initializing 
evidence tables as follows: we set Ax (a:) to 1 if a: is consistent with evidence e, 
and we set Ax (a;) to 0 otherwise. 

Given some evidence e, a jointree algorithm propagates messages between 
clusters. After passing two message per edge in the jointree, one can compute 
the marginals Pr(C, e) for every cluster C. There are two main methods for prop- 
agating messages in a jointree, known as the Shenoy-Shafer [10] and the Hugin 
[4] architectures, which we review next (see [5,6] for a thorough introduction to 
each architecture). 

2.1 The Shenoy Shafer Architecture 

Shenoy-Shafer propagation proceeds as follows. First, evidence e is entered into 
the jointree through evidence indicators. A cluster is then selected as the root and 
message propagation proceeds in two phases, inward and outward. In the inward 
phase, messages are passed toward the root. In the outward phase, messages are 
passed away from the root. Cluster i sends a message to cluster j only when it 
has received messages from all its other neighbors k. A message from cluster i 
to cluster j is a table defined as follows: 

Mij = E Mk^, (1) 

Ci\Sy fc/j 

where <l>i is the product of CPTs and evidence tables assigned to cluster i. 

Once message propagation is finished in the Shenoy-Shafer architecture, we 
have the following for each cluster i in the jointree: 

Pr(C„e) (2) 

k 

^ A table is an array which is indexed by variable instantiations. Specihcally, a table 
(j) over variables X is indexed by the instantiations x of X. Its entries 0(x) are in 
[0, 1]. We assume familiarity with table operations, such as multiplication, division 
and marginalization [3]. 
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Let us now look at the time and space requirements of the Shenoy-Shafer 
architecture. The space requirements are simply those needed to store the mes- 
sages computed by Equation 1. That is, we need two tables for each separator 
Sij, one table stores the message from cluster i to cluster j, and the other stores 
the message from j to i. We will assume in our time analysis below the availabil- 
ity of the table <Pi, which represents the product of all CPT and evidence tables 
assigned to cluster i. This is meant to simplify our time analysis, but we stress 
that one of the attractive aspects of the Shenoy-Shafer architecture is that one 
can afford to keep this table in factored form, therefore, avoiding the need to 
allocate space for this table which may be significant. 

As for time requirements, suppose that we have a jointree with n clusters 
and width w. Suppose further that the table is already available for each 
cluster i, and let us bound the amount of work performed by the inward and 
outward passes of the Shenoy-Shafer architecture, i.e., the work needed to eval- 
uate Equations 1 and 2. We first note that for each cluster i, Equation 1 has 
to be evaluated Ui times and Equation 2 has to be evaluated once, where Ui 
is the number of neighbors for cluster i. Each evaluation of Equation 1 leads 
to multiplying tables, whose variables are all in cluster C^. Moreover, each 
evaluation of Equation 2 leads to multiplying rii + I tables, whose variables are 
also all in cluster C^. The total complexity is then: 

- l)exp(|Ci|) -hniexp(|Ci|)), 

i 

(since multiplying n elements requires n — 1 multiplications) which reduces 
to exp{w)) where w is the jointree width. This further reduces to 

0(o;exp(w)), where a = ^ term that ranges from 0(n) to O(n^) de- 

pending on the jointree struture. For example, we may have what is known as 
a binary jointree in which each cluster has at most three neighbors, leading to 
a < 6(n — 1). Or we may have a jointree with one cluster having the other n — 1 
clusters as its neighbors, leading to a = — n. 

Given a Bayesian network with n variables, and an elimination order of width 
w, we can construct a binary jointree for the network with the following prop- 
erties: the jointree has width < w, and no more than 2n — 1 clusters. Hence, we 
can avoid the quadratic complexity suggested above by a careful construction of 
the jointree, although this can dramatically increase the space requirements. 



2.2 The Hugin Architecture 

We now discuss the Hugin architecture, which tends to take less time but uses 
more space. We first consider Fig. 1 which provides an abstraction of the dif- 
ference between the Hugin and Shenoy-Shafer architectures. Here, node r has 
neighbors 1, . . . , n, where each edge between r and its neighbor i is labeled with 
a number Xi. Node r is also labeled with a number Xr- Suppose now that node 
r wants to send a message mi to each of its neighbors i, where the content of 
this message is: mi = XrY\j^iXj. One way to do this is to compute the above 
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Fig. 1. A simplified jointree message example 



product for each neighbor i. This is the approach taken by the Shenoy-Shafer 
architecture and leads to a quadratic complexity in the number of neighbors 
n. Alternatively, we can compute the product p = Xr YVj=i only once, and 
then use it to compute the message to each neighbor i as rrii = pjxi. This is 
the approach taken by the Hugin architecture. It is clearly more efficient as it 
only requires one division for each message, while the first method requires n 
multiplications per message. However, it requires that Xi yf 0, otherwise, p/xi is 
not defined. But if the message is going to be later multiplied by an expression 
of the form xia, then we can define p/0 to be 0, or any other number for that 
matter, and our computations will be correct since {xr Ylj^i xj)xia = 0 regard- 
less of the message value. This is exactly what happens in the Hugin architecture 
when computing joint marginals and, hence, the division by zero does not pose a 
problem. Yet, for some other queries which we discuss later, the quantity nr=i 
is needed when Xr = 0, in which case it cannot be recovered by dividing p by 
Xr- Moreover, as we see next, since the Hugin architecture does not save the 
numbers Xi, it cannot compute the product YYi=i^i through an explicit mul- 
tiplication of the terms appearing in this product. This is basically the main 
difference between the Hugin and Shenoy-Shafer architectures except that the 
above analysis is applied to tables instead of numbers. 

Hugin propagation proceeds similarly to Shenoy-Shafer by entering evidence 
e using evidence tables; selecting a cluster as root; and propagating messages in 
two phases, inward and outward. The Hugin method, however, differs in some 
major ways. First, it maintains a table <Pij with each separator, whose entries 
are initialized to Is. It also maintains a table with each cluster i, initialized 
to the product of all CPTs and evidence tables assigned to cluster i] see Fig. 2. 

Cluster i passes a message to neighboring cluster j only when i has received 
messages from all its other neighbors k. When cluster i is ready to send a message 
to cluster j, it does the following: 

- it saves the separator table <l>ij into 

- it computes a new separator table <l^ij = X)ci\s ^ 

- it computes a message to cluster j : Mij = <l>ij 

- it multiplies the computed message into the table of cluster j: <Pj = ^jMij 

After the inward and outward-passes of Hugin propagation are completed, we 
have the following for each cluster i in the jointree: Pr(Ci,e) = <Pi. The space 
requirements for the Hugin architecture are those needed to store cluster and 
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separator tables: one table for each cluster and one table for each separator. 
Note that the Hugin architecture does not save the message Mij = <Pij sent 
from cluster i to cluster j. 

As for time requirements, suppose that we have a jointree with n clusters 
and width w. Suppose further that the initial tables and <Pij are already 
available for each cluster i and separator ij. Let us now bound the amount of 
work performed by the inward and outward passes of the Hugin architecture, i.e., 
the work needed to pass a message from each cluster i to each of its neighbors 
j. Saving the old separator table takes 0(exp(|Syj)); computing the message 
takes 0(exp(|Ci|) + exp(|Sij|)), and multiplying the message into the table of 
cluster j takes 0(exp(|Cj|)). Hence, if each cluster i has Ui neighbors, the total 
complexity is: 

EE 0(exp(|Cj|) + 2exp(|Sij|) + exp(|Cjj)), 

i 3 

which reduces to 0(nexp(w)), where w is the jointree width. Note that this 
result holds regardless of the jointree structure. Hence, the linear complexity in 
n is obtained for any jointree, without a need to use a special jointree as in the 
Shenoy-Shafer architecture. 

2.3 Beyond Joint Marginals 

The Hugin architecture gains efficiency over the Shenoy-Shafer architecture on 
arbitrary jointrees by employing division. Moreover, although the use of division 
does not prevent the architecture from producing joint marginals, it does prevent 
it from producing answers to some other queries which are useful for a variety 
of applications including sensitivity analysis [1], local optimization problems like 
parameter learning [9], and MAP approximation [7]. 

To explain these additional queries, suppose that we just finished jointree 
propagation using evidence e. This gives us the probability of evidence e, since 
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for any cluster C, we have Pr(e) = Suppose now that we need the 

probability of some new evidence which results from erasing the value of variable 
X from e, denoted e — A". More generally, suppose that we need the probability 
of evidence e — X,x, where a: is a value of variable X which is different from 
the one appearing in evidence e. Both of these probabilities can be obtained 
locally using the Shenoy-Shafer architecture without further propagation, but 
cannot in general be computed locally using the Hugin architecture (see [2] for 
some special cases). The other type of query which falls in this category is that 
of computing the derivative dPr{e ) / of the likelihood Pr(e) with respect 
to a network parameter 02,|u = 0. We will now show how these queries can be 
answered locally using the Shenoy-Shafer architecture. We later show how they 
can be computed using the modification we suggest to the Hugin architecture. 

To compute the probabilities Pr(e — X, x) for a variable X using the Shenoy- 
Shafer architecture, we first need to identify the cluster i which contains the 
evidence table Ax. The probabilities Pr(e — X,x), for each value x, are then 
available in the following table which is defined over variable X\ 

Ci\X k j 

Here, (j>k ranges over all CPTs and evidence tables assigned to cluster i, excluding 
the evidence table Ax [2,8]. 

Similarly, to compute the derivatives dFr{e) /d0x\u, we need to identify the 
cluster i which is assigned the CPT of variable X. The derivatives for all instan- 
tiations XU of variable X and its parents U are then available in the following 
table, which is defined over family XXJ: 

Ci\JfUU k j 

Here, (j)k ranges over all CPTs and evidence tables assigned to cluster i, excluding 
the CPT 6»x|u [8]. 

Hugin is not able to handle these queries in general because it does not save 
messages that are exchanged between clusters, and because table division may 
lead to a division by zero (see [2] for a special case where Hugin can handle some 
of these queries) . 

3 Getting the Best of Both Worlds 

We now present a jointree propagation algorithm that combines the query an- 
swering power of Shenoy-Shafer propagation with the efficiency of Hugin prop- 
agation. The messages sent between clusters are the same as Shenoy-Shafer 
messages, but the tables stored at each cluster represent the product of assigned 
tables and incoming messages in a manner similar to the Hugin approach. 

As discussed earlier, division is a key to Hugin efficiency, but it also produces 
a loss of information. The problem is that multiplication by zero is noninvert- 
ible. In Sect. 3.1 we discuss the problem in detail and introduce a simple and 
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efficient technique to circumvent it. In Sect. 3.2 we describe the new propagation 
algorithm. Section 3.3 details the semantics of messages in the new architecture, 
as well as the content of cluster and separator tables. 

3.1 Handling Zeros 

This section introduces the notion of a zero conscious number: a pair (z, b), where 
z is a scalar and 6 is a bit. It also defines various operations on these numbers. 
We then show in the following section that by employing such numbers in a 
variation on the Hugin architecture, we can attain the same query answering 
power exhibited by the Shenoy-Shafer architecture. 

To motivate zero conscious numbers, consider a set of numbers xi,. . . ,x„ 
and suppose that for each i = l,...,n, our goal is to compute the product 
mi = rij/i ■ We distinguish between three cases: 

Case 1: xi, . . . ,x„ contain no zeros. Then Wj = p/xi, where p = Xj. 

Case 2: xi, . . . ,x„ contain a single zero Xk- Then = 0 

for all i k. 

Case 3: Xi, ... ,x„ contain more than one zero. Then Wj = 0 for all i. 

Note that in Case 3, we have = 0 = ^ since we have 

more than one zero. Hence, Case 2 and Case 3 can be merged together. Using 
these cases, the messages can be computed efficiently by first computing a 
pair (z, 6 ) such that: 

- & is a bit which indicates whether any of the elements Xi is a zero (Cases 2,3). 

- z is the product of all elements Xi, excluding the single zero if one exists. 

For example, if the elements Xi are 1, 2, 3, 4, 5, we would compute (1*2*3*4* 
5,/) = (120,/) since no zero was withheld. For elements 1,2, 0,4, 5, we would 
compute (40, t). Finally, for elements 1,2, 0,4,0, we would compute (0,t). 

Then each message mi can be computed from the pair (z, b) as follows: 

( zjxi if 6 = / 

mi = < 0 lib = t and Xi ^ Q 

[ z lib = t and Xi = 0 

This can be thought of as dividing the pair (z, 6 ) by Xi. In fact, we will call the 
pair (z, b) a zero conscious number and define division by a scalar as given above. 
Two more operations on zero conscious numbers will be needed. In particular, 
multiplying a zero conscious number (z,5) by a scalar c is defined as: 

if& = /andc =0 
^ ^ ' ( (c * z, 6 ) otherwise. 

Moreover, the addition of two zero conscious numbers is defined as: 

( (zi,&i) if 6 i = / and &2 = t 
(zi,&i) + (z 2 , 62 ) = < {z2,b2) iibi=t and 62 = / 

[ (zi + Z2, bi) otherwise. 
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Finally, we define 

u \ z lib = f 

= otherwise. 

Note that if the zero conscious number (z, b) was computed for elements 
x\, . . . , Xn, then real{z, b) recovers the product xj of these elements. 

We can also defined zero conscious tables which map variable instantiations 
to zero conscious numbers. Now let iF be a zero conscious table and ^ be a 
standard table. The marginal product W<P, and division <F/<P can then 

be defined in the obvious way. Moreover, real{'P) is defined as a standard table 
which results from applying the real operation to each entry of iF. 

3.2 Algorithmic Description 

The algorithm we propose is very similar to Hugin propagation. Like the Hugin 
architecture, it maintains a table <Pij for each separator ij. A table is also main- 
tained for each cluster. Unlike for Hugin, the table <fi associated with cluster i 
is a zero conscious table. From here on, we will use W to denote a zero conscious 
table and ^ to denote a standard table. 



Initialization. The separator table entries are initialized to 1, and the cluster 
table entries are initialized to (1, /). The CPTs and evidence tables assigned to 
a cluster are multiplied into the corresponding cluster table. 



Message Propagation. This algorithm requires the same obedience to message 
ordering that the other jointree algorithms do. That is, messages are sent from 
the leaves, toward some root, then back from the root to the leaves. A message 
from cluster i to cluster j is computed as follows: 

'I'temp ^ 



Figure 3 illustrates an example of this propagation scheme. This algorithm 
basically mirrors Hugin propagation, but with zero conscious tables for clusters. 
It has some minor time and space overhead over what the Hugin algorithm 
requires. The time overhead consists of an additional logical test per operation. 
The storage requirement is also fairly insignificant. Single precision floating point 
numbers require 32 bits and double precision numbers require 64 bits, so an extra 
bit (or even byte, if the processor can’t efficiently manipulate bits) per number 
will increase the space requirements only slightly. 

3.3 The Semantics 

The message passing semantics is the same as for Shenoy-Shafer. 
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Fig. 3. Zero conscious propagation illnstrated on a simple jointree under evidence b, 
where the left cluster is root. The jointree is for network C A ^ B, where 9a = .6, 
9h\a -2, -7, ^c|a 1? and 9a\a -5 



Theorem 1. The message passed from cluster i to cluster j is the same as the 
message passed using Shenoy-Shafer propagation. That is, if^ij is the product of 
all tables assigned to clusters on the i-side of edge ij, and ifX. are the variables 
appearing in these tables, then the message = X)x\Si 3 

The cluster and separator table semantics closely resemble the corresponding 
Hugin semantics. 

Theorem 2. After all the messages have been passed, real{Ti) = Pr(Ci,e) and 
= Pr(Sij, e) for all neighboring clusters i and j. 

Although very similar to the Hugin semantics, the difference in the cluster 
table makes these semantics significantly more powerful. 

Theorem 3. Let 4>n...(j)in be the CPTs and evidence tables assigned to cluster 
i, and let be the set of variables of 4>im- Then after message passing is 
complete, 

y ' = I y ] 'Li I /4>ik- 

Ci\Xkmiik J \Ci\Xk J 

This theorem shows that we can use zero conscious division to perform the 
same local computations permitted by the Shenoy-Shafer architecture. Consider 
Fig. 3 for an example, which depicts an example of zero conscious propagation 
under evidence e = b. Given the table associated with the left cluster, we have 
Pr(e) = .12+.14+.14 = .4. Suppose now that we want to compute the probability 
of {e — B,b) = b. We can do this by identifying the cluster AB which contains 
the evidence table Xb and then computing: This leads to the 

first division shown in Fig. 4, showing that the probability of 6 = .6. 

Similarly, to compute the partial derivatives of Pr(e) with respect to pa- 
rameters Oc\Aj we need LAc/(ic\A which is also shown in Fig. 4. According to 
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Fig. 4. Evidence retraction and partial derivative operations for the jointree in Fig. 3 



this computation, for example, we have dPr{e) / dOc\a = -28. Finally, the partial 
derivatives dPx{e) / dOa are obtained from (X)c '^ac)!Qa, which is also shown in 
Fig. 4. According to this computation, 9Pr(e ) /dOa = .7. 

4 Conclusion 

We proposed a combination of the Shenoy-Shafer and Hugin architectures, in 
which we use zero conscious tables/potentials. The use of these tables provide 
a simple way to exploit the efficiency of the Hugin method, while extending the 
set of queries that can be answered efficiently. For the price of a single bit per 
cluster entry, and some minimal logic operations, all queries answerable using 
Shenoy-Shafer propagation can now be answered using Hugin type operations. 
For applications that require more than just marginal probabilities, such as local 
search methods for MAP and sensitivity analysis, this can produce a significant 
speed up over the use of Shenoy-Shafer architecture. 
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A Proof of Theorems 

We first introduce a few lemmas about zero conscious potentials that we will 
need to prove the theorems. 

Lemma 1. Let 'L he the zero conscious product of potentials Then 

Proof. Consider an entry if of T, and the unique compatible entry <j)i in from 
each table. Then, based on the division property of zero conscious numbers, 
4’/4'i = rij/i 4‘j- This is true for all entries if, thus proving the result. 

Lemma 2. Let c he a factor of zero conscious numbers a = (ua, Za) and h = 
{ns, Zb). Then (a -I- 6)/c = a/c -I- 6/c 

Proof. We simply break it into cases and show that the definitions of the oper- 
ations require that they agree. 

Case 1 (c = 0) Then Za = Zb = t. Thus {a + h)/c= {ua + nb,t)/c = Ua + Ub = 
ajc + h/c since ajc = Ua and h/c = nt,. 

Case 2 (c 0,Zo = f,Zb = f) Then (a-|-6)/c = {na + nb,f)/c = (na + nb)/c = 
Uaf c + Ubf c = ajcphjc. 

Case 3 (c 0, Zo = /, Zb = t) Then (a + b)/c= ajc= ajc + hlc since b/c= 0. 

Case 4 (c y^ 0, Zo = t, Zb = f) Then (a + b)/c= h/c = a/cA- h/c since a/c = 0. 

Case 5 (c y^ 0, Zo = t, Zb = t) Then (a + b)/c = 0 = a/c + b/c. 

Lemma 3. Let T he the zero conscious product of potentials Then 

(Sc\s^)/^* = X)c\s(^/^*) '^here C are the variables ofT, and S are the 
variables of<Ti. 

Proof. Consider the instances c of C compatible with instance s of S. Repeated 
application of Lemma 2 implies that summing them, then dividing is the same 
as dividing them summing. This is true for each element, and so for the table as 
a whole. 
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Proof of Theorem 1 

We will first prove the theorem for messages sent towards the root. We will then 
prove it for messages away from the root. 

Consider a leaf node i sending a message to its neighbor j. The cluster poten- 
tial Wi consists of the product of all tables (j)i\...(j)in assigned to it. The message 
sent is (X)c\s^*)/1 = X)c\s life which is the Shenoy-Shafer 
message from a leaf node. 

Now, assume by way of induction that all messages toward the root have been 
received for cluster i. The cluster 'L'i contains the product of the assigned tables 
..., (j)in and the incoming upward messages Mki from neighbors k. Then the 
upward message is (Ec\s^*)A = Ec\s Ilm life = Ec\s life 
which again is the Shenoy-Shafer message. 

So, for the upward pass the messages sent equal the corresponding Shenoy- 
Shafer messages. 

Now, consider a message sent from the root r. The cluster potential tfV con- 
sists of the product of the assigned tables, and the incoming messages of all 
neighbors. The message sent to neighbor j is (Ec\s '^r)/Mjr which by applica- 
tion of Lemma 3 followed by Lemma 1 yields Ec\s Ili/j ^ir which is again 
the appropriate Shenoy-Shafer message. 

Now, assume by way of induction that a cluster i has received messages which 
equal the corresponding Shenoy-Shafer messages from each of its neighbors. 
Then, the message sent to neighbor j away from the root is (Ec\s 
which again appealing to Lemmas 3 and 1 equals Ec\s which is 

the same as the Shenoy-Shafer message. 

Proof of Theorem 2 

After propagation completes, Wi contains the product of the locally assigned 
tables and the incoming messages. Thus real{'l/i) = life Oj = 

rij Since the messages are the same as the Shenoy-Shafer messages, and 
in the Shenoy-Shafer architecture Pr(Ci,e) = real{'Pi) = Pr(Ci,e). 

For separator ij, where i is closer to the root, after propagation completes, 
(Pij = reaZ(Eci\Sij ~ EciVSi^ = Eci\Sij Pc(Ci,e) = Pr(S^,e). 

Proof of Theorem 3 

After propagation completes, consists of the product of the tables 
assigned to cluster i, and the incoming messages Mji. Then (Eci\Xi, '^'i)/4’ik = 
Eci\Xfc Ylm^k 4’im Oj Mji, which contains the partial derivatives of Pr(e) with 
respect to the parameters of (pik [8]. 
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Abstract. The problem of efficient characterization of inclusion neigh- 
bourhood is crucial for some methods of learning (equivalence classes of) 
Bayesian networks. In this paper, neighbouring equivalence classes of a 
given equivalence class of Bayesian networks are characterized efficiently 
by means of the respective essential graph. The characterization reveals 
hidded internal structure of the inclusion neighbourhood. More exactly, 
upper neighbours, that is, those neighbouring equivalence classes which 
describe more independencies, are completely characterized here. First, 
every upper neighbour is characterized by a pair {[a,b],C) where [a, fe] 
is an edge in the essential graph and C C A \ {a, 6} a disjoint set of 
nodes. Second, if [a, 6] is fixed, the class of sets C which characterize the 
respective neighbours is a tuft of sets determined by its least set and the 
list of its maximal sets. These sets can be read directly from the essential 
graph. An analogous characterization of lower neighbours, which is more 
complex, is mentioned. 



1 Motivation 

1.1 Learning Bayesian Networks 

Several approaches to learning Bayesian networks use the method of maximiza- 
tion of a quality criterion, named also ’quality measure’ [3] and ’score metric’ 
[4]. Quality criterion is a function, designed by a statistician, which ascribes to 
data and a network a real number which ’evaluates’ how the statistical model 
determined by the network is suitable to explain the occurence of data. Since 
the actual aim of the learning procedure is to get a statistical model (defined 
by a network) reasonable quality criteria do not distinguish between equivalent 
Bayesian networks, that is, between networks which define the same statistical 
model. Therefore, from operational point of view, the goal is to learn an equiva- 
lence class of Bayesian networks, that is, a class of acyclic directed graphs (over 
a fixed set of nodes N). 

As direct maximization of a quality criterion is typically infeasible the method 
of local search is often used. The main idea of this approach is that suitable 
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concept of neighbourhood is introduced for acyclic directed graphs over N. The 
point is that the change in the value of a (reasonable) quality criterion is easy 
to compute for neighbouring graphs. Thus, instead of global maximization of a 
quality criterion one searches for a local maximum of the criterion with respect to 
the considered neighbourhood structure and this task is usually computationally 
feasible. Typical neighbourhood structures used in practice are defined by means 
of simple graphical operations with considered graphs - for details see [5,7]. 

The algorithms of this kind can also be classified according to the method of 
representation of equivalence classes of networks. In some algorithms, an equiv- 
alence class is represented by any of its members which may, however, result 
in computational complications. In other algorithms, a special representative of 
each equivalence class is used. The most popular representative of an equivalence 
class of Bayesian networks is the essential graph which is a certain chain graph 
describing some common features of acyclic directed graphs from the class. The 
term ’essential graph’ was proposed by Andersson, Madigan and Perlman [1]; 
altenative names ’completed pattern’, ’maximally oriented graph for a pattern’ 
and ’completed pdag’ have also appeared in the literature. 

1.2 Inclusion Neighbourhood 

There exists a neighbourhood structure (for equivalence classes of Bayesian net- 
works) which has a good theoretical basis. The inclusion of statistical models 
defined by the networks, which corresponds to the inclusion of conditional inde- 
pendence structures defined by the network, induces a natural inclusion ordering 
on the collection of equivalence classes. This ordering induces a neighbourhood 
concept then. More specifically, two different types of neighbouring equivalence 
classes are assigned to every equivalence class of networks: the upper neighbours 
and lower neighbours. Thus, the inclusion neighbourhood, sometimes also named 
’inclusion boundary neighbourhood’ [7], consists of these two parts. There are 
also some practical reasons for using the inclusion neighbourhood - for details see 
[4]. Note that Chickering [5] has recently confirmed Meek’s conjecture [9] about 
transformational characterization of the inclusion ordering. A consequence of 
this result is a graphical description of the inclusion neighbourhood in terms of 
the collection of members of the considered equivalence class (see Sect. 2.4). 

The topic of this contribution is to characterize the inclusion neighbourhood 
of a given equivalence class of Bayesian networks in terms of the respective 
essential graph in such a way that it can be used efficiently in a method of local 
search for maximization of a quality criterion. Two recent papers were devoted to 
this problem, but, in author’s view, none of them brought a satisfactory solution 
to the problem. 

Chickering, in Sect. 5 of [5], gave a method which is able to generate tenta- 
tively all neighbouring equivalence classes (of a given equivalence class described 
by the respective essential graph). More specifically, two (composite) graphical 
operations applicable to an essential graph and respective legality tests which 
are able to decide whether the respective graphical operation leads to a neigh- 
bouring equivalence class are designed in that paper. One of the operations and 
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the respective legality test are aimed to generate upper neighbours, the other 
operation and test correspond to lower neighbours. Although the graphical de- 
scription of the inclusion neighbourhood in terms of individual networks from 
Sect. 2.4 implies that every inclusion neighbour can be reached in this way the 
method has two drawbacks. 

— The first drawback of the method is that it is tentative: different graphical 
operations may lead to the same equivalence class. Therefore, additional 
checking must be done to cure this imperfection. 

— The second drawback of this mechanistic approach is that it does not allow 
one to discern possible internal structure of the inclusion neighbourhood. 

Auvray and Wehenkel [2] made an attempt at direct characterization of the 
inclusion neighbourhood. Their characterization of the upper inclusion neigh- 
bourhood, that is, of those neighbouring equivalence classes which describe more 
independencies, removes the first drawback. They uniquely characterized and 
classified neighbouring equivalence classes of a given equivalence class (described 
in terms of the respective essential graph) by means of certain mathematical ob- 
jects. However, these object are still unnecessarily complicated which means 
that their characterization of upper inclusion neighbourhood is too awkward. In 
particular, the second drawback is not removed by their approach since their 
approach does not allow one to make out the internal structure of the inclu- 
sion neighbourhood. Moreover, their characterization is incomplete: only partial 
direct characterization of lower neighbours is given there. 

1.3 Compact Characterization of Inclusion Neighbours 

In this paper an elegant characterization of the inclusion neighbourhood of a 
given equivalence class in terms of the respective essential graph is presented. 
Each inclusion neighbour is uniquely described by a pair ([a, 6],C) where [a, 6] 
is an unordered pair of distinct nodes and C C N\{a,b} a, disjoint set of nodes. 
More specifically, [a, b] is an edge of the essential graph in case of the upper 
neighbourhood while [a, b] is a pair of nodes which is not an edge in the essential 
graph in case of the lower neigbourhood. 

The first new observation made in this paper is that every inclusion neighbour 
is uniquely characterized by a pair ([a, 6], C) of this kind. The second observation 
is that, for given [a, b ] , the collection of those sets C which correspond to inclusion 
neighbours has a special form. 

In this contribution, a complete analysis of the upper inclusion neighbour- 
hood is given. In this case, the collection of sets C for a given edge [a, b] has the 
form of a tuft. This means that it is a collection of sets with the least set (= 
the unique minimal set) and with possibly several maximal sets such that every 
set which contains the least set and which is contained in one of the maximal 
sets belongs to the collection. In particular, a tuft is completely described by its 
least set and by the list of its maximal sets. Given an essential graph G* and 
an edge [a, b] in G* the least and maximal sets of the respective tuft of sets are 
characterized directly in terms of G*. 
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The structure of the lower inclusion neighbourhood is similar but more com- 
plex. In that case, given a pair of nodes which is not an edge in the essential 
graph, the respective collection of sets C is the union of (at most) two tufts. 
The least and maximal sets of these two tufts can also be read from the essential 
graph. However, because of page limitation the case of lower neighbourhood will 
be completely analyzed in a future paper [15]. 

Note that the characterization of inclusion neighbours by means of pairs 
([o, 6],C) where C C N \{a,b} and the way how it is done in this paper is 
not incidental. An interesting fact is that, from a certain perspective which is 
explained in details in Chapter 8 of [13], the pair ([a, 6],C) has close relation 
to conditional independence interpretation of the ’move’ from the considered 
equivalence class to its respective inclusion neighbour. 

The proofs are omitted because of strict limit 12 pages for contributions 
to Proceedings of ECSQARU 2003. The reader interested in the proofs can 
download an extended version of this paper at 

http: / /www. utia.cas.cz/ user_data/studeny/aal b03.html. 

The proofs combine the ideas motivated by an arithmetic approach to the de- 
scription of Bayesian network models from [13] with certain graphical procedures 
which were already used in [2]. 

2 Basic Concepts 

2.1 Graphical Notions 

Graphs considered here have a finite non-empty set N as the set of nodes and 
two possible types of edges. An undirected edge or a line over N is a subset of N 
of cardinality two, that is, an unordered pair {a, b} where a,b G N, a ^ b. The 
respective notation is a — b. A directed edge or an arrow over N is an ordered 
pair (a, b) where a,b G N, a ^ b. The notation a ^ b reflects its pictorial 
representation. A hybrid graph over N is a graph without multiple edges, that 
is, a triplet H = {N, C{H),A{H)) where N is a set of nodes, C{H) a set of lines 
over N and A{H) a set of arrows over N such that whenever (a, b) G A{H) then 
(5, a) ^ A{H) and {a,b} = {b,a} ^ ^{H)- A pair [a, &] of distinct elements of 
N will be called an edge in H (between a and b) if one of the following cases 
occurs: a — b in H, a ^ b in H and 6 — >■ a in H. If 0 yf A C N then the induced 
subgraph Ha of H is the triplet (A,C{H) r\V{A),A{H) fl (A x A)) where V{A) 
denotes the power set of A (= the collection of subsets of A). 

A set AT C N is complete in a hybrid graph H over N if \/ a, b^Afa^b 
one has a — b in H. By a clique of H will be understood a maximal complete 
set in H (with respect to set inclusion). The collection of cliques of H will be 
denoted by cliques{H). A set G C iV is connected in H if, for every a, 6 G N, 
there exists an undirected path connecting them, that is, a sequence of distinct 
nodes a = ci, . . . , c„ = 6, n > 1 such that Ci — Ci+i in H for i = 1, . . . , n — 1. 
Connectivity components of H are maximal connected sets in H . 




Characterization of Inclusion Neighbourhood 



165 



An undirected graph is a hybrid graph without arrows, that is, A{H) = 0. 
A directed graph is a hybrid graph having arrows only, that is, C{H) = 0. An 
acyclic directed graph is a directed graph without directed cycles, that is, without 
any sequence di, . . . , dn+i = di, n > 3 such that d\, . . . ,dn are distinct and 
di — >■ dj+i in H for i = 1, . . . , n. 

A chain graph is a hybrid graph H for which there exists a chain, that is, an 
ordered partitioning of N into non-empty sets, called blocks, Bi, . . . , Bm, m> 1 
such that 

— if a — b in H then a,b € Bi for some 1 < z < to, 

— if a — >■ 6 in id then a € Bi,b € Bj with 1 < i < j < m. 

An equivalent definition of a chain graph is that it is a hybrid graph H without 
semi-directed cycles, that is, without any sequence d\, . . . , dn, dn+i = d\, n > 3 
such that d\, . . . ,dn are distinct, d\ d 2 in H and Vz = 2, . . . , n either di — >■ di+i 
or di — di+i in H - see Lemma 2.1 in [12]. Clearly, every undirected graph and 
every acyclic directed graph is a chain graph. Moreover, there is no arrow in 
a chain graph between nodes of a connected set C C iV; in other words, the 
induced subgraph Hq is undirected. Thus, the set of parents of C, that is, 

pa^(C) = {a € N -,3b £ C a— >-6 in id} 
is disjoint with C. The set 

neniC) = {a G N \ C ; 3b G C a — 6 in id} 
will be named the set of neighbours of C . 



2.2 Bayesian Networks and Their Equivalence 

A Bayesian network is a certain statistical model, that is, a class of (multidi- 
mensional probability) distributions, appended to an acyclic directed graph. It 
could be introduced as the class of distributions (on a fixed sample space) which 
factorize according to the graph in a certain way. An alternative definition of 
that class can be given terms of conditional independence restrictions, using 
the d-separation criterion from [10] or using the moralization criterion from [8] 
which are known to be equivalent. Since exact definitions of these concepts are 
not needed in this paper they are omitted. Nevertheless, given an acyclic di- 
rected graph G over N, the symbol X{G) will be used to denote the collection 
of conditional independence restrictions determined by G. Moreover, the phrase 
’’Bayesian network” will be used as a synonym for an acyclic directed graph 
throughout the rest of the paper. 

An important concept is the concept of equivalence of Bayesian networks. 
Two Bayesian networks G\ and G^ are considered to be equivalent if they rep- 
resent the same statistical model, which requirement is typically equivalent to 
the condition T(Gi) = X{G 2 ). Given a equivalence class Q of Bayesian networks 
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over N the symbol I{G) will denote the shared collection of conditional indepen- 
dence restrictions I{G) for G G G- Verma and Pearl [16] gave a direct graphical 
characterization of equivalent Bayesian networks which can be used as its formal 
definition here. The underlying graph of a hybrid graph H over N is an undi- 
rected graph iJ“ over N such that a — 6 in i/“ iff [a, b] is an edge in H. An 
immorality in is a special induced subgraph of H , namely the configuration 
a ^ c <— b where a, 6, c are distinct nodes and the pair [a, b] is not an edge in 
H. Two Bayesian networks G\,G 2 over N are (graphically) equivalent iff they 
have the same underlying graph and the same collection of immoralities. The 
equivalence characterization makes the following definition consistent: given an 
equivalence class G of Bayesian networks, a pair [a, b] of distinct nodes is called 
an edge in G if [a, b] is an edge in some G G G, which means, it is an edge in 
every G G G- 

2.3 Essential Graphs 

An equivalence class G of Bayesian networks (over N) can be described by its 
essential graph which is a hybrid graph G* (over N) such that 

— a — >■ 6 in G* if and only if a — >■ 6 in G for every G G G, 

— a — 6 in G* if and only if there exist Gi,G 2 G G such that a — >■ 6 in Gi and 
& — >■ a in G 2 . 

A graphical characterization of essential graphs was given by Andersson, Madi- 
gan and Perlman as Theorem 4.1 in [1]. Recently, a simpler alternative char- 
acterization has been found in [14] and, independently, in [11]. As complete 
characterization of essential graphs is not needed in this paper, it is omitted. 
However, what is needed is the following observation. It follows from Theorem 
4.1 of [1] that every essential graph H (of an equivalence class of Bayesian net- 
works) is a chain graph without flags. Recall that by a flag in a hybrid graph H 
is meant a special induced subgraph of H, namely the configuration a ^ c — b 
where a, b, c are distinct nodes and the pair [o, b] is not an edge in H. Note that 
every chain graph without flags has the following property: for every component 
C of H and a,b G G one has pa^(a) = pa^(6); in particular, pa^(a) = p&fj{G) 
for any a G G. 

Remark 1. Note that there are other possible ways of representing an equivalence 
class G of Bayesian networks. One of them is the concept of largest chain graph 
[6] of the collection of chain graphs which are equivalent to (any) G G G- Another 
alternative is brought by the arithmetic approach presented in Sect. 8.4 of [13] 
which offers the concept of standard imset. 

2.4 Inclusion Ordering and Neighbourhood 

The inclusion ordering on the set of equivalence classes of Bayesian networks 
(over N) is defined by the binary relation I(/C) C X{C) for equivalence classes 
/C and L. The symbol T(/C) C X{C) will denote the strict ordering, that is, the 
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situation when I(/C) C X{C) and I(/C) ^ X{C). Finally, the symbol 2i(/C) IZ X{C) 
will mean that 2i(/C) C X{C) but there is no equivalence class Q of Bayesian 
networks (over N) such that 2i(/C) C X{Q) C X{C). If this is the case then C is 
called the upper neighbour of /C and 1C is called the lower neighbour of L. By 
the inclusion neighbourhood of an equivalence class is understood the collection 
its upper and lower neigbours. 

Transformational characterization of the inclusion ordering from [5] allows 
one to derive a simple graphical description of the relation I(/C) Z X{C) as a 
consequence - see Lemma 8.5 in [13]. 

Lemma 1. If K. and C are equivalence classes of Bayesian networks over N 
then one has T(/C) Z I{C) iff there exists K & 1C and L G C such that L is made 
of K by the removal of (exactly) one edge. 



Remark 2. The relation I(/C) C I(£) corresponds to the situation when the 
statistical model given hy K G K. contains the statistical model given hy L G C. 
The networks in JC have more edges than networks in C then. The reader may 
ask why C is supposed to be ’above’ tC in this paper (and not conversely). The 
terminology used in this paper simply emphasizes the conditional independence 
interpretation of considered statistical models which is in the center of author’s 
interests - for more detailed justification see Remark 8.10 in [13]. 

2.5 Tuft of Sets 

Let T be a non-empty collection of subsets of N, that is, 0 T C V{N), and 
7)nax denotes the collection of maximal sets in T (with respect to set inclusion) . 
The collection T will be called a tuft of sets if 

— T has the least set Tmim that is, Tmin G T with Tmin C T for T G T, 

— every set T C TV such that T^in CTCT' for some T' G 7(nax belongs to T. 

Thus, a tuft of sets T is determined by its unique least set and by the 
class of its maximal sets Tmax- Alternatively, it can be described by Tmi„ and 
the class {T' \ Tmi„ ; T' G 7)nax}- More specifically, assume that A C N and 
is a non-empty class of incomparable subsets of \ A, that is, there are no 
B,B' G B with B C B' . Introduce a special notation: 

TUFT(A],B) = {T = AUC;3B G B CCB}. 

Evidently, T = TUFT(Aj,B) is a tuft such that 7][nax = {A\J B\B G B} and 
Tmin = A. Of course, every tuft of subsets of N can be described in this way. 

Example 1. Suppose N = {a, 6, c, d} and put A = {a}. Consider the class B = 
{ {&}, {c}, {d} } which is a class of incomparable subsets of \ A (actually, the 
sets in B are disjoint). Then TUFT(Aj,B) consists of four sets: {a}, {a,b}, {a,c} 
and {a, d}. The tuft is shown in Fig. 1. 
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3 Upper Inclusion Neigbourhood 

3.1 Description of Upper Neighbours 

By the upper neighouhood of an equivalence class /C of Bayesian networks is 
understood the collection o^(/C) of equivalence classes C such that I(/C) IZ I{C). 
It follows from Lemma 1 that each K G 1C and each edge in K. define together 
an element of o^(/C) and every element of o^(/C) is obtained in this way. Thus, 
the upper neigbourhood o^(/C) is, in fact, described in terms of elements of 1C. 
Nevertheless, the above described correspondence is not a one-to-one mapping 
because different elements of JC may yield the same neighbouring class C. 

One the other hand, every neighbouring class is uniquely characterized by 
a certain pair ([a, 6],C) where a,b G N, a ^ b and C C N \ {a, 6}. The pair 
([a, 6], C) can be introduced in graphical terms as follows. 

Let K. be an equivalence class of Bayesian networks over N, C G o^(/C). 
Choose K G 1C and L G C such that L is obtained from K by the removal 
of an arrow a — >■ & in iL. Then C will be described by the pair ([a, 6], C) 
where C = pa^(&) \ {a}. 

To show that the definition above is consistent one has to show that the pair 
([a, b],C) does not depend on the choice of K and L and that distinct pairs are 
ascribed to distinct upper neighbours. 

Proposition 1. Let K. be an equivalence class of Bayesian networks, C\,L 2 G 
o^(/C). Suppose, for i = 1,2, that graphs Ki G K. and Li G Ci are given such 
that Li is made of Ki by the removal of an arrow Oi -G bi in Ki and Ci = 
P^Kiih) \ {a^}■ Then Ci = £2 iff [{ai,bi} = { 02 , 62 } and Ci = C 2 ]. 

The proof of Proposition 1, which can be found in the exteded version of 
this paper, is based on a special arithmetic characterization of equivalence of 
Bayesian networks. Note that one can perhaps also prove this result using purely 
graphical tools, but the given proof is more elegant. 

Example 2. To illustrate concepts introduced above let us consider an equiva- 
lence class /C of Bayesian networks over N = {a,b,c,d,e} shown in the lower 
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layer of Fig. 2. For every i = 1,2,3, an acyclic directed graph Li is obtained from 
Ki £ JChy the removal of the arrow a ^ b (see the medium layer of the figure) . 
In this example, each of Ki, i = 1,2,3 establishes a different neighbouring class 
Li £ - the respective essential graphs are in the upper layer of Fig. 2. 

Since \ {®} = {Cj d}, the equivalence class containing L 2 is characterized 

by the pair ([a, 6], {c, d}). 



Remark 3. The pair ([a, 6], C) describing uniquely an upper neighbour C £ o^(/C) 
was introduced in terms of individual networks from /C and £. If /C and L are 
represented by the respective essential graphs K* and L* then [a, b] is simply 
the edge of K* which is not an edge in L* . The question of how to define C in 
terms of K* and L* have not been examined so far by the author. On the other 
hand, the pair {[a,b],C) is obtained immediately if /C and C are represented by 
means of their standard imsets - see [13]. 
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3.2 Characterization of Upper Neighbourhood 

Given an equivalence class /C the next step is to characterize those pairs ([a, b ] , C) 
which define elements L G o^(/C). In this section, this task is answered for a fixed 
unordered pair of distinct nodes [a, b]. For this purpose, put 

Cj^{a ^ b) = {C ; BK G JC such that a ^ b in K and C = pa^(6) \ {a} } 

for every ordered pair of distinct nodes (a, 6). It follows from what it says in 
Sect. 3.1 that C^{a — >■ b) U C^{b — >• a) is the class of sets which has to be 
characterized. Therefore, given (a, &), one needs to find out when C^(a — >■ b) is 
non-empty and describe that collection if it is non-empty. 

Proposition 2. Let K. be an equivalence class of Bayesian networks, K* the 
essential graph of K. and (a,b) an ordered pair of distinct nodes of K* . Put 
P = paj^, (b) \ {a} and M = {c G neK* (b) \ {o} ; [a, c] is an edge in K*}. Then 

(i) C^{a ^ b) ^ 0 iff [a, 5] is an edge in K* and b ^ pa^^. (o), that is, either 
a ^ b in K* or a — b in K* . 

(ii) If a ^ b in K* or a — b in K* then C'jf{a -G b) = TUFT(P|diques(iF|^)) 
where cliques{K^) = {0} by convention. 

The proof is given in the extended version of this paper. 

Corollary 1. Let 1C be an equivalence class of Bayesian networks over N , K* 
the essential graph of K. and [a, 6] an edge in K* . Then the collection of those 
sets C C N \ {a, b} such that ([a, b],C) describes an upper neighbour C G o^(/C) 
is a tuft TUFT(P|diques(i^^)) where 

(a) if a ^ b in K* then P = pa^^.(5) \ {a} and M = neK*{b), 

(b) if a ^ b in K* then P = pa^. (a) \ {b} and M = nex* (a), 

(c) if a — b in K* then M = {c G N; a — c — b in K*} and P = pa^* (b) = 
pax, (a). 



Proof. Recall that one needs to characterize C^{a -G b) U Cf^{b — >■ a) using 
Proposition 2. In case (a) observe that Cf-{b -G a) = 0. As concerns Cj^(a — >■ b), 
the fact that K* has no flags implies M = nex,{b). The case (b) is symmetric. 
In case (c) realize that the fact that K* has no flags implies pa^^. (6) = pax, (a). 
This observation and the fact that K* is a chain graph allow one to show that 
C'jf{a — >■ 6) = — >■ a) yf 0. 

Example 3. To illustrate the previous result consider the essential graphs shown 
in Fig. 3. The case (a) from Corollary 1 occurs for the graph K* in the left- 
hand picture of the figure. More specifically, one has P = {c}, M = {d, e} and 
cliques{Klf) = {{d},{e}}. Thus, the class of sets C C N\{a,b} such that 
([a, 6],C) describes an upper neighbour of the respective equivalence class is 
TUFT({c}|{d}, {e}), that is, the class which involves three sets: {c}, {c,d} and 
{c, e}. Indeed, it was shown in Example 2 that those upper neighbours of the 
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Fig. 3. Two essential graphs 



respective equivalence class which ’’correspond” to the ’’removal” of a — >■ 5 are 
characterized by pairs ([a, 5],{c}), ([a, 6], {c, d}) and {[a,b],{c,e}). 

If the graph G* in the right-hand picture of Fig. 3 is considered then the case 
(c) from Corollary 1 occurs. One has P = 0, M = {c,d,e} and cliques{G*j^) = 
{{c,d},{d,e}}. The respective class TUFT(0|{c, d}, {d, e}) has five sets: 0, {c}, 
{d}, {e}, {c, d} and {d, e}. 



4 Conclusions 

In this contribution a characterization of the upper inclusion neighbourhood was 
presented. Analogous results for the lower inclusion neighbourhood have also 
been achieved and they will be presented in a later paper [15]. Note that lower 
neighbours of a given equivalence class of Bayesian networks over N can also be 
described uniquely by pairs ([a, b],G) where a,b £ N, a ^ b and C C N \ {a, &} 
and that, given [a,b], the class of sets G which correspond to lower neighbours 
is the union of two tufts which can be characterized in terms of the respective 
essential graph. An important fact is that if T(/C) IZ T(/l) then the pair ([a, b],G) 
describing C as one of the upper neighbours of /C coincides with the pair ([a, b],G) 
which describes /C as one of the lower neighbours of C. Thus, there is internal 
consistency of both characterizations and the pair ([a, 6], C) can be viewed as a 
natural characteristic of the ’move’ between /C and C. 

As explained in Sect. 1 the presented characterization is more elegant than 
the previous ones. Indeed, Chickering [5] only gave a tentative algorithmic 
method and Auvrey and Wehenkel [2] characterized every inclusion neighbour by 
an unordered pair of nodes and by an opaque collection of immoralities, namely 
those which are either created or cancelled if an equivalence class is replaced 
by its inclusion neighbour. Finally, the characterization presented in this paper 
has a close connection to an arithmetic method of description of equivalence 
clases of Bayesian networks (from Chapter 8 of [13]) and leads to conditional 
independence interpretation of ’moves’ in the method of local search. 
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Abstract. Mixtures of truncated exponential (MTE) distributions have 
been shown to be a powerful alternative to discretisation within the 
framework of Bayesian networks. One of the features of the MTE model 
is that standard propagation algorithms as Shenoy-Shafer and Lazy prop- 
agation can be used. Estimating conditional MTE densities from data is 
a rather difficult problem since, as far as we know, such densities cannot 
be expressed in parametric form in the general case. In the univariate 
case, regression-based estimators have been successfully employed. In 
this paper, we propose a method to estimate conditional MTE densities 
using mixed trees, which are graphical structures similar to classification 
trees. Criteria for selecting the variables during the construction of the 
tree and for pruning the leaves are defined in terms of the mean square 
error and entropy-like measures. 



1 Introduction 

Bayesian networks have been widely employed to reason under uncertainty in 
systems involving many variables where the uncertainty is represented in terms of 
a multivariate probability distribution. The structure of the network encodes the 
independence relationships amongst the variables, so that the network actually 
induces a factorisation of the joint distribution which allow the definition of 
efficient algorithms for probabilistic inference [2,5,6,10,13]. 

The algorithms mentioned above are, in principle, designed for discrete vari- 
ables. However, in many practical applications, it is common to find problems in 
which discrete and continuous variables appear simultaneously. Some methods 
approach this kind of problems for special models as, for instance, the conditional 
Gaussian distribution [4,9], but the most general solution was to discretise the 
continuous variables and then proceed using algorithms for discrete variables [3] . 

Recently, an alternative to discretisation has been proposed, based on the 
use of MTE distributions. The usefulness of MTE models can be understood 
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if we realise that discretising a variable or a set of variables can be regarded 
as approximating its distribution by a mixture of uniforms, and that standard 
propagation algorithms in fact are able to work with mixtures of uniforms. If we 
could use, instead of uniforms, other distributions with higher fitting power and 
verifying that standard propagation algorithms remain valid, then discretisation 
would not necessarily be the best choice. This is the motivation of MTE models. 
Propagation in Bayesian networks with MTE distributions was shown to be 
correct for the Shenoy-Shafer architecture and for MCMC simulation algorithms 
in [7] , and a method for estimating univariate MTE distributions from data was 
proposed in [8]. 

However, a complete specification of a Bayesian network requires not only 
univariate distributions for the root nodes but conditional distributions for the 
others as well. In this paper, we propose a method to estimate conditional MTE 
densities using mixed trees [7], which are graphical structures similar to classi- 
fication trees [11]. 

This article continues with a description of the MTE model in Sect. 2. The 
representation based on mixed trees can be found in Sect. 3. Section 4 contains 
the formulation of the method proposed here for constructing mixed trees, which 
is illustrated with some experiments reported in Sect. 5. The paper ends with 
conclusions in Sect. 6. 

2 The MTE Model 

Throughout this paper, random variables will be denoted by capital letters, 
and their values by lowercase letters. In the multi-dimensional case, boldfaced 
characters will be used. The domain of the variable X is denoted by l7x- The 
MTE model is defined by its corresponding potential and density as follows [7]: 

Definition 1. (MTE potential) Let X. be a mixed n-dimensional random vari- 
able. Let Y = (Yi, . . . , Yd) and Z = (Zi, . . . , Zc) be the discrete and continuous 
parts of X respectively, with c-\- d = n. We say that a function f : l7x '— >■ 
is a mixture of truncated exponentials potential (MTE potential) if one of the 
next two conditions holds: 

i. / can be written as 

mid c I 

./(x) = /(y,z) = ao-b^Oiexp + Zk > (1) 

i=i [j=i k=i ) 

for all X G I7x, where Oi, i = 0, . . . ,m and b\^\ i = I, . . . ,m, j = I, . . . ,n 
are real numbers. 

ii. There is a partition l7i, . . . ,L2k of l7x verifying that the domain of the con- 
tinuous variables, f2z, is divided into hypercubes and such that f is defined 
as 

f(x) = fi(x) if xef2, , 

where each fi, i = 1, . . . ,k can be written in the form of equation (1). 
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Example 1. The function (p defined as 



2 g3zi+Z2 g^l+^2 if 0 < < 1, 0 < Z2 < 2 

1 + if 0 < < 1, 2 < 22 < 3 



(p{Zl,Z2) = { 



„2zi+Z2 



-5e^i 



+ 2^2 



if 1 < 2i < 2, 0 < 22 < 2 
if 1 < 2i < 2, 2 < 22 < 3 



is an MTE potential since each of its parts are MTE potentials. 



Definition 2. (MTE density) An MTE potential f is an MTE density if 




In a Bayesian network, we find two types of densities: 

1. For each variable X which is a root of the network, a density f{x) is given. 

2. For each variable X with parents Y, a conditional density f{x\y) is given. 

A conditional MTE density f{x\y) is an MTE potential f{x,y) such that 
fixing y to each of its possible values, the resulting function is a density for X. 

3 Mixed Trees 

In [7] a data structure was proposed to represent MTE potentials: the so-called 
mixed prohahility trees or mixed trees for short. The formal definition is as follows: 

Definition 3. (Mixed tree) We say that a tree T is a mixed tree if it meets the 
following conditions: 

i. Every internal node represents a random variable ( either discrete or contin- 
uous). 

ii. Every arc outgoing from a continuous variable Z is labeled with an inter- 
val of values of Z, so that the domain of Z is the union of the intervals 
corresponding to the arcs Z-outgoing. 

iii. Every discrete variable has a number of outgoing arcs equal to its number of 
states. 

iv. Each leaf node contains an MTE potential defined on variables in the path 
from the root to that leaf. 

Mixed trees can represent MTE potentials defined by parts. Each entire 
branch in the tree determines one sub-region of the space where the poten- 
tial is defined, and the function stored in the leaf of a branch is the definition of 
the potential in the corresponding sub-region. 
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l+e^i+^2 l+5e2i+2z2 l + 2e^i+^2 



Fig. 1. A mixed probability tree representing the potential (j> in example 2 

Example 2. Consider the following MTE potential, defined for a discrete variable 
(Fi) and two continuous variables (Zi and Z 2 ). 

'2 + e3zi+z2 if = 0, 0 < zi < 1, 0 < Z2 < 2 

1 + if 2/1 = 0, 0 < zi < 1, 2 < Z2 < 3 

1 + if 2/1 = 0, 1< zi < 2, 0 < Z2 < 2 

1 + if 2/1 = 0, 1< zi < 2, 2 < Z2 < 3 

zi, Z2) = 

1 + 2e2^i+^= if 2/1 = 1, 0 < 2 i < 1, 0 < Z 2 < 2 

1 + 2e^i+^=^ if 2/1 = 1, 0 < 2i < 1, 2 < Z 2 < 3 

1 if 2/1 = 1, K 2i < 2, 0 < Z2 < 2 

o 

^ if 2/1 = 1, 1 < < 2, 2 < Z 2 < 3 

2 

A possible representation of this potential by means of a mixed probability tree 
is displayed in Fig. 1. 

The operations required for probability propagation in Bayesian networks 
(restriction, marginalisation and combination) can be carried out by means of 
algorithms very similar to those described, for instance in [3,12]. 

4 Constructing Mixed Trees from Data 

The aim of this paper is to describe a method to construct, from a database, a 
conditional MTE density /(a;|y) for each family in the network, where X denotes 
the child variable and Y its parents. A first approach to estimate conditional 
MTE densities from data could be the application of a standard estimation 
procedure as, for instance, maximum likelihood (ML). However, even in the 
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case of univariate MTE densities there is not a way to solve the ML equations 
exactly. The method proposed in [8] to construct estimators for the parameters 
of the univariate MTE density is not valid for the conditional case, since more 
restrictions should be imposed over the parameters in order to force the MTE 
potential to integrate up to 1 for each combination of values of the conditioning 
variables, i.e. to force the MTE potential to actually be a conditional density. 

Our proposal consists of partitioning the domain of the conditioning variables 
Y and then fit a univariate density f{x) in each one of the splits using the 
method described in [8] . Obviously, the accuracy of the estimated density would 
strongly depend on the number of splits: the higher this number is the better the 
fitting power becomes. Nevertheless, another factor that always determines the 
goodness of fit is the sample size. If the number of splits is too high, the subset 
of the sample that would be used to estimate the density in each region may be 
too small or even of size zero. This argument must be taken into account when 
deciding where to partition the domain. 

The process of splitting the domain of the variables in Y can be seen as 
constructing a mixed tree where each internal node is a variable T} G Y and 
finally each leaf will contain an MTE density for the variable X in the split 
determined by the branch of the tree that leads to it. In order to design an 
algorithm to construct the tree, the following issues must be addressed: 

1. Selection of the variable to expand. Similarly to the case of probability trees 
[12], it can help to construct smaller trees without loss of accuracy. 

2. Determination of the splits of the selected variable. The ideal is to select 
the cut-points of the domain in such a way that the data best fits to an 
MTE density in each split. Trying to evaluate this is not an easy task and, 
furthermore, may be too costly, since we think that an optimal partitioning 
strategy should consider aspects like the remaining sample size in each part 
and the accuracy of the models fitted with respect to different partitions, in 
order to choose the best. This can be regarded as an optimisation problem 
in which each movement through the search space requires the estimation 
of new MTE densities and the evaluation of their accuracy. That is why 
we have decided to partition the domain in equal width intervals, being the 
number of splits a parameter given by the user according to the available 
resources. 

3. Definition of a criterion to stop branching the tree. Again, following the 
guidelines in [12], we propose to expand the tree until the number of leaves 
does not surpass a threshold which is given by the user, again taking into 
account the available resources. 

4. Pruning the tree. Once the tree has been constructed, it is useful to have 
a method to reduce its size by pruning branches which are equal or very 
similar. In fact this is not very useful during the learning stage, but when the 
learnt network is used for probability propagation, the combination operation 
can produce very large trees, which might lead the computer to run out of 
memory. 

In the next section we shall study some of the issues above in more detail. 
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4.1 Selection of the Variable to Expand 



The initial step in the construction of the tree we are looking for is to build a tree 
consisting of only a leaf, which is an MTE density fitted to the entire database, 
i.e. the conditional distribution /(a:|y) is considered to be the same for every 
value y. Then, the variable selected to expand the tree is that one which divides 
the domain of Y into more different sub-regions, understanding the difference 
among the sub-regions as the goodness of fit of the current conditional density 
f{x\y) in each of them. The idea behind this criterion is to avoid expanding the 
tree when no gain in accuracy is achieved. 

In order to determine how different the goodness of fit among the different 
splits is, we use the following measure, called splitting gain. 



Definition 4. (Splitting gain) Let f{x) be the MTE density for the target vari- 
able X in a leaf of a mixed tree which is to be expanded. Let W & Y be a 
variable not already expanded in the current branch of the tree, and D\, . . . ,Dj 
the splits into which the domain of Y would be partitioned if we expanded by 
W. Let ei, . . . ,6j denote the normalised mean squared error between f{x) and 
the empirical histogram of variable X in D\, .. . ,Dj respectively. We define the 
splitting gain of variable W in a function f{x) as 

3 

SG{f,W) = Y,edogei . ( 2 ) 

j=i 

This measure, which is similar to the entropy of a probability distribution, 
is equal to — log j if the normalised errors are uniform, and equal to 0 if all the 
normalised errors are 0 but one, which is equal to 1 . Our proposal is to expand by 
the variable maximising the splitting gain. If the gain is vary low, it means that 
the error in the different splits is the same, i.e., a single MTE desity fits the data 
in the different splits with the same accuracy. It suggests that there is no need 
to split that variable, since the fitted models in each region would be the same. 
On the contrary, a high value of SG (close to zero) means that the function fits 
badly at least in one of the splits, while in the others the performance is good. 
In this case it is worthy to expand that variable, since the new functions will be 
more accurate. 



4.2 The Learning Algorithm 



In order to sum up what we have presented in this section, we describe the 
algorithm to construct a mixed tree with at most M leaves for the conditional 
density of a variable X given its parents Y, from a database D, and splitting 
the domain of each variable into j parts. 
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LEARN MIXED TREE(D,A:,Y,M,j) 

1. Fit an MTE density f{x) for variable X to the data in D using the method 
described in [8]. 

2. If the number of leaves in the tree is lower than M, 

a) Choose the variable IF G Y such that 

W = argmax^^gY'5'G'(IF) . 

b) Split the domain of W into j parts. 

c) Split D into j parts Di, . . . ,Dj according to the partition of W. 

d) For t = 1 to j 

LEARN_MIXED_TREE(A.Y,Y \ {W}.M.j). 



4.3 Pruning the Tree 

Pruning a mixed tree consists of choosing a variable whose children are leaves 
(i.e. functions) and replace it and its children by a single function which is fitted 
again to joined resulting domain. We define the error of pruning in the following 
way: 

Definition 5. (Error of pruning) Let W be a variable in a mixed tree with chil- 
dren fi{x),... ,fj{x). Let Dw be the subset of the database corresponding to 
values compatible with the branch that leads to W, and Dwa i = ^ ■ ■ ■ , j the 
partition of Dw induced by the split of W . Let Ci, i = 1... ,j be the mean 
squared errors of fi with respect to the empirical histograms for the variable X 
in Dwi, t = 1, . . . , j. Let f{x) be an MTE density fitted to the data in Dw and 
e the mean squared error for f{x) in Dw- We define the error of pruning the 
variable W as 



3 

EP{W) = e-J2e^ ■ (3) 

i=l 

There are two ways of carrying out the pruning: 

1. Sequentially select a variable W minimising EP(W) and prune it until the 
size of the tree (its number of leaves) is reduced to a given threshold (specified 
by the user). 

2. Prune all the leaves W for which EP{W) < e, where e > 0 is an error 
threshold given by the user. 

5 Experiments 



In order to illustrate how the algorithm works we shall describe in this section 
the results of fitting a mixed tree to a conditional normal distribution. The 
experiment was conducted as follows. 
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Fig. 2. The density function of a Af{0.5y, \/0.75) distribution 



We consider a pair of random variables X and Y following a bivariate normal 
distribution with means vector = (0,0) and covariance matrix 

Cov(x,r) = , 

which means that the covariance is axY = 0.5, the marginal of Y is Af{0, 1) and 
the conditional distribution of X given Y is J\f{0.5y, -\/0.75). The conditional 
density corresponding to this distribution is displayed in Fig. 2. 

Then we generated a sample of 500 values of X from the conditional distri- 
bution. In order to simulate a value for X, first a value for Y is drawn from 
the Af(0, 1) distribution and then a value for X is drawn from J\f{0.5y, -\/0.75), 
replacing y by the simulated value. We have run the algorithm with 2, 3, 4 and 

5 number of splits when branching a variable. The fitted conditional densities 
are displayed in Figs. 3, 4, 5, and 6. 

It can be seen how the accuracy of the estimated density increases as the 
number of splits grows. 

6 Conclusions 

We have proposed a method for estimating conditional MTE densities from data. 
This work, together with [8] allows to learn the distributions in a Bayesian net- 
work, either marginal or conditional. The model has shown to be adequate for 
fitting some well known conditional models as the conditional normal distribu- 
tion. 
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Fig. 3. Fitted density taking two splits per variable 







Fig. 4. Fitted density taking three splits per variable 
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Fig. 5. Fitted density taking four splits per variable 




Fig. 6. Fitted density taking five splits per variable 
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However, some work yet remains to be done. A thorough experimental testing 
is necessary to actually check the practical value of the algorithm. The problem is 
that experiments with models which involve many variables are difficult to report 
(plots are not possible). So we plan to randomly generate mixed trees and then 
generate samples from them and estimate new mixed trees from the samples. 
The goodness of fit can be measured using the Kullback-Leibler divergence. 

Furthermore, the performance of mixed trees during the propagation is an- 
other open field. An important issue in this framework is the pruning of the tree, 
which allows to define approximate propagation algorithms, perhaps based on 
importance sampling [12] or in ideas related to Penniless propagation [1]. 
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Abstract. Model complexity is an important factor to consider when 
selecting among graphical models. When all variables are observed, the 
complexity of a model can be measured by its standard dimension, i.e. the 
number of independent parameters. When latent variables are present, 
however, the standard dimension might no longer be appropriate. In- 
stead, an effective dimension should be used [5]. Zhang & Kocka [13] 
showed how to compute the effective dimensions of partially observed 
trees. In this paper we solve the same problem for partially observed 
polytrees. 



1 Introduction 

Learning graphical models from data has been widely studied in recent years. 
Two approaches have been developed. One approach builds models based on 
statistical independence tests. The other approach searches, in a space of models, 
the model that maximizes a certain scoring function. 

From the Bayesian perspective, a natural scoring function is the marginal like- 
lihood of model given data. In [4] , Cooper and Herskovits gave a formula for com- 
puting the Bayesian score in the case of complete data. At the same time, they 
showed that exact computation of the score is intractable when latent variables 
are present. In such cases asymptotic approximations of the marginal likelihood 
such as the Bayesian Information Criterion (BIC) [10] and the Cheeseman-Stutz 
Criterion (CS) [3] are usually employed. 

The BIC score has two parts: one evaluates the fit of the model to the data 
and the other penalizes the model according to its complexity. The complexity 
of a model is measured by use of the standard dimension, i.e. the number of 
independent parameters. However, the standard dimension might prove incorrect 
when latent variables are present. Consider the model O^H with two variables 
- observed variable O and latent variable H. All the parameters in P{H\0) are 
irrelevant as they do not influence the fit of the model to the (observed) data. 
Thus, there is no reason to penalize the model for such parameters. 

Reexamining the derivation of the BIC score, Geiger et al. [5] concluded that 
the standard dimension should be replaced by the effective dimension. They also 
showed that the effective dimension of a model is the rank of the Jacobian matrix 
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of the transformation between the parameters of the model and the parameters 
of the distribution over the observed variables. 

Effective dimension is useful for several reasons. First, BIG with effective 
dimension was in [5] shown to be an asymptotic approximation of the marginal 
likelihood at regular points, although it was later shown not to be so at singular 
points [9]. Second, the BIG and GS scores, when used together with standard 
dimension, can easily be shown to be inconsistent model selection criteria. When 
used with effective dimension, however, they are likely to be consistent mainly 
because of the close relationship between effective dimension and model inclusion 
(see Lemma 1).^ Third, effective dimension fits perfectly into the penalization 
scheme in the AIG score [1]. Note that the AIG score has a quite different 
objective than the marginal likelihood. Fourth, effective dimension can be used 
to judge upon the identifiability of a model and its parameters. This approach 
is used, for example, in mark-recovery and capture-recapture studies [2]. 

The straightforward method of computing effective dimension has an expo- 
nential complexity in the number of observed nodes. The main concern of this 
paper is how to compute effective dimensions efficiently. For partially observed 
trees, this problem was solved in [13], where a theorem allowing a decomposition 
of the problem into the same problem for a set of latent class models was proved. 
The effective dimensions of latent class models can be computed either using the 
tight upper bound developed in [6] or by the direct computation of the rank of 
the Jacobian matrix. See also [11] and [12] for interesting special cases. 

In this paper we present a solution for partially observed polytrees. In Sect. 2 
we introduce our notations, definitions, special classes of models and known re- 
sults concerning effective dimension. In Sect. 3 we show how to compute the 
effective dimension of a polytree consisting of a single latent node and its ob- 
served Markov boundary. We call such a polytree a primitive polytree and relate 
its effective dimension to some latent class model. We utilize a very special pa- 
rameterization to obtain this result. Section 4 utilizes a result by Zhang & Kocka 
[13] to decompose polytrees using some of their observed nodes. Moreover it 
shows how to decompose polytrees further using their latent nodes. Thus we 
decompose any polytree into a set of primitive polytrees. We end by concluding 
in Sect. 5. 



2 Basic Concepts 

In this section we review basic concepts of graphs, graphical models and results 
concerning effective dimension of models with latent variables. 

2.1 Graphs 

A graph G is a pair {N, E), where N is a set of nodes and if is a set of edges, i.e. 
a subset of iV x A^ of ordered pairs of distinct nodes. Each node X £ N, denoted 

^ The marginal likelihood has not been shown to be consistent for models with latent 
variables yet, either. 
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by an upper-case letter, represents a discrete random variable. We denote the 
number of states of a variable X by |X| and a particular state of a variable X 
by a lower-case letter x. We often use a set of variables R C N to represent a 
joint variable over its elements which has number of states \R\ = Ilxefl 

An Acyclic Directed Graph (DAG) is a graph where all edges are directed and 
there are no directed cycles. If a graph has a directed edge B, then node A is 
parent of node B, i.e. A G Pa{B), and B is child of A, i.e. B G Ch{A). The union 
of a node’s children and parents is called neighbors, i.e. Ne{A) = Pa{A)UCh{A). 
The union of parents, children and parents of children of a node is called the 
Markov boundary, i.e. Mb{A) = Pa{A) U Ch{A) Uz£Ch(A) Pa{Z). A node A in 
a DAG is d-separated by its Markov boundary Mb{A) from all other nodes (see 
[ 7 ] for the definition of d-separation) . 

2.2 Graphical Models 

A Bayesian network is a pair (G, 9a) where G is a DAG and Oq are parameters. 
The parameters describe the conditional probability distribution P{X\Pa{X)) 
for each variable X given its parents Pa{X). The standard dimension of a 
Bayesian network model is ds{G) = X)xgat(I^I ~ 1 ) * rivGPa(x) l^l> where 
N is the set of all nodes. 

A Bayesian network represents a joint probability distribution P{N\G,6a) 
via the factorization formula P{N\G, 9 q) = Oxgat D-separation in 

G implies a conditional independence w.r.t the joint probability P. In particular, 
any node A is independent of all other nodes given its Markov boundary. 

A model is completely observed if all its nodes are observed. Otherwise it is 
partially observed. The unobserved nodes are called latent nodes. A Bayesian 
network model M{G) is the set of all joint probability distributions over the 
observed nodes that can be represented by any Bayesian network (G, 9 q). 

We say that model M\ includes model M2 if for every parameterization 62 
of M2 there exists a parameterization 6*1 of Mi such that Mi and M2 represent 
the same joint probability distribution over observed variables. Two models Mi 
and M2 are said to be equivalent if Mi includes M2 and M2 includes Mi . Note 
that these definitions extend the standard ones by considering the possibility of 
having both latent and observed variables. 

A Bayesian network model whose DAG is a rooted tree is in this paper 
referred to as a tree model or simply a tree. A latent class (LG) model is a 
special tree model that consists of one latent node and a number of observed 
nodes. In a tree model, each latent node and its neighbors form an LG model. 

In a rooted tree, each node has at most one parent. In a polytree, a node 
may have multiple parents and there are no cycles. A polytree model or simply 
a polytree is a Bayesian network model whose DAG is a polytree. A primitive 
polytree (PP) is a polytree with one latent node H and a number of observed 
nodes consisting of the parents of H, the children of H, and the parents of the 
children of H . In a polytree, each latent node together with its Markov boundary 
forms a primitive polytree. A compact polytree (GP) model is a polytree where 
each observed node has either no children or just one child and no parents. Each 
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Fig. 1. Example of a) latent class, b) primitive polytree and c) compact polytree models 



latent node in a CP model induces, together with its Markov boundary, a PP 
model. Examples of LC, PP, and CP models are shown in Fig. 1. 

Note that all LC models are PP models and all PP models are CP models. 
This hierarchy of classes of models plays an important role in this paper. 

A tree model is regular if for each latent node H holds \H\ < ^ 

Each irregular tree is equivalent to some regular tree, which can be obtained 
via a simple regularization process reducing the cardinality of the latent nodes 
concerned [6]. Thus, by computing the effective dimension for all regular trees 
one solves the problem for all trees. 

2.3 Effective Dimension 

In a (partially observed) graphical model G, the joint probability distribution 
P{0) over the observed variables O depends on the parameters 0a of the 
model. It can be viewed as a transformation from the parameters to the vec- 
tor (P{0i),P{02), ■ ■ .), where Oj is a combination of the values of the observed 
variables. As the parameters vary, the vector spans a subspace of an Euclidean 
space. The dimension of this subspace is defined to be the effective dimension 
of the model G [5]. We denote it by de{G). The following lemma is obvious. 

Lemma 1. Let Mi and M 2 he two graphical models having the same set of 
observed variables. If Mi includes M 2 then de{Mi) > de{M 2 ). 

We denote by Jo{(^g) = [Jjk] = [ ^ ] the Jacobian matrix of the afore- 
mentioned transformation. Rows of Jo{0g) correspond to states in the observed 
space O of the model G, columns to the parameters 9g- Geiger et al. [5] showed 
that the effective dimension de{G) of a model G is the rank of Jo{(^g)- 

The rank of a matrix is the number of (row or column) vectors in a basis of 
the matrix. A basis is a set of linearly independent vectors such that all other 
vectors can be expressed as a linear combination of the vectors in the basis. Note 
that for any set of independent vectors there is always a basis which includes 
this set. 

The rank of Jo{0g) is in general a function of 9g but Geiger et al. [5] showed 
that it is constant almost everywhere, except a set of singular points having a 
zero measure, under the assumption that the model is parameterized in such a 
way that the joint probability distribution it represents is a polynomial function 
of the parameters. Therefore, two models M and M* having the same parameters 
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and model equation, where the parameters of M are subject to some additional 
inequality constraints compared to M* , have the same effective dimension if the 
constrained parameters of M form a set of a positive measure in the space of 
parameters of M* . 

Geiger et al. [5] suggest the following numerical approach to compute the 
effective dimension of a model: generate a random 9, compute the Jacobian and 
its rank with sufficient numerical precision. We used this algorithm implemented 
in Matlab by Rusakov [8] to study the effective dimensions of some polytrees 
empirically. 

Settimi & Smith [11] solved the effective dimension of latent class models 
with two observed nodes. Kocka & Zhang [6] derived a tight upper bound on the 
effective dimension of any latent class model. 

The following theorem takes advantage of the above solution(s) for LC models 
and solves the problem of the effective dimension of a tree with latent variables. 

Theorem 1. [13] Let M he a regular partially observed tree model. Let Mi he 
the local LC model induced hy a latent node Hi. Then the difference between the 
standard and effective dimensions of M equals the sum of the same differences 
over all the local models, i.e. ds{M) — de{M) = ds{Mi) — de{Mi), where 
the summation is over all latent nodes in the model. 

Moreover the following straightforward theorem proves essential in the next 
two sections. 

Theorem 2. [13] Let M he a graphical model over observed variables O and 
latent variables H . Let S C O be a subset of the observed nodes such that there 
exist two nonempty sets of variables Vi and V 2 where Vi fl S' = 0, V 2 H S = 0, 
ViUV 2 US = OUi/ and V\ J1 V 2 IS is true for any distribution encoded by the 
model M . Let Mq, M\ and M 2 he the sub models induced in M by the sets S, 
Vi U S and V 2 C S . Then de{M) = de{Mi)+de{M 2 )—ds{MQ). 



3 Effective Dimension of Primitive Polytrees 

In this section, we prove a theorem that relates the effective dimension of a PP 
model to that of an LC model. Consider the PP model in Fig. 2 (a). Denote it by 
M . Construct an LC model with a structure as shown in Fig. 2 (c) and where the 
number of states of variable Y is the product [Pi IIP 2 1 of those of variables Pi and 
P 2 , that of Xi is l+dCij — l)||Oi|), and that of X 2 is 1+(|C'2| — 1)||02|). Denote 
the LC model Mlc- According to the theorem, de{M) = de{MLc)+Ylf^i{\Oi\ — 



Theorem 3. Let M he a primitive polytree model. Let H be the unique latent 
node; Pi {i = 1, . . . , L) be the parents of H; Tr {r = 1, ..., R) be those children of 
H that have only one parent, namely H; and Cj (j = 1, . . . , J) he those children 
of H that have more than one parent. For each j, let Okj {k = 1, . . . , Kj) be the 
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Fig. 2. a) primitive polytree model M, b) model Mx with the deterministic nodes X 
and Y introduced and c) latent class model Mlc 



observed parents of Cj. Let M^c be a latent class model with one latent variable 
H and observed variables Y, Tr {r = ,R), and Xj (j = 1, . . . , J) where 

I Kj 

iFi i^.-i = i+(ic,i-i)n iOfc.il (j= 

1 k—l 

Then de{M) = de{MLc) +Y.j,ki\Ok.j\ ~ 1) +E^(I^^I “ 1) + 1 “ 0* \P^\■ 



Proof. We prove the theorem in two steps. First, we introduce a graphical model 
Mx with a special parameterization and show that de{M) = de{Mx). Second, 
we use the latent class model Mlc and show that de{Mx) = de{MLc) + 1 + 

\Pi\. We sometimes denote by Oj the Cartesian 
product over all Okj and similarly by P the same over all Pi. 

The model Mx is obtained from the model M by the introduction of a new 
latent variable Y and new latent variables Xj for each node Cj . The parameters 
P{Y\P) are fixed in such a deterministic way that there is a one-to-one corre- 
spondence between the values of Y and the values of the Cartesian product of 
all Pi. The parameters of each P{Cj\Xj^Oj) are fixed in a deterministic way, 
too. Thus, these ’’parameters” are in fact not parameters of the model Mx. We 
denote each state of Xj (except one state) by a pair of numbers (c*,o*) where 
c* G< 1, \Cj\ — 1 > and o* G< 1, \Oj\ >. The last state of Xj is denoted by 
a number d = |C1, |. The four states of the node Xj in the example in Table 1 
are {(1, 1), (1,2), (1,3), 2} in this notation. We set the P{Cj\Xj,Oj) in this way: 
p{Cj = c\Xj = c'^Oj) = 1 if c = c'; and p{Cj = c\Xj = {c*,o*),Oj = o) = 1 
if ((c = c* and o = o*) or (c = c' and o yf o*)). All other probabilities in 
P{Cj\Xj, Oj) are zero. An example of such a distribution is in Table 1. 



Table 1. Example of the deterministic distribution P{Cj\Xj, Oj) in the model Mx for 
\Oj\ — 3, \Cj\ = 2 and thus \Xj\ = 4 



Xj = 


1 (Id) 


(1,2) 


(1,3) 


(2) 1 




1 


2 


3 


1 


2 


3 


1 


2 


3 


1 


2 


3 


Cj = l 


1 


0 


0 


0 


1 


0 


0 


0 


1 


0 


0 


0 


Cj = 2 


0 


1 


1 


1 


0 


1 


1 


1 


0 


1 


1 


1 
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Consider a joint probability distribution over observed variables represented 
by the model M. If we try to represent the same distribution by the model Mx 
we find out that a transformation from the parameters describing P{Cj\H,Oj) 
to parameters describing P{Xj\H) always exists however it can yield P{Xj\H) 
which is not a probability distribution (some values are lower than zero or even 
higher than one but note that any parameterization makes always sure the sum 
given any condition is equal to one). Thus, the model Mx doesn’t include the 
model M. But let us consider a model having the same parameters and 
model equation as the model Mx but relaxing all the constraints which the model 
Mx puts on its parameters, except the constraint that the model M^ represents 
a joint probability distribution over the observed variables. Note that M^ is not 
a graphical model and P{Xj\H) is not necessarily a probability distribution. 
Obviously, includes M, M^ includes Mx and moreover the parameters of 
the model Mx form a set with a positive measure in the space of the parameters 
of the model M^- Thus, the effective dimension of the two models Mx and Mx 
is the same. 

Because by marginalizing the variables X and Y out of the graphical model 
Mx one obtains the graphical model M, it follows that the model M includes the 
model Mx- From the facts that M^ includes M, M includes Mx and de{Mx) = 
de{M^) and from Lemma 1 follows that de{M) = de{Mx)- Thus, the first claim 
is proved. 

Note that the nodes X and Y can be introduced in any polytree for any 
latent variable and all the arguments above apply to such a case, too. 

Back to the model Mx- The fixed deterministic distribution P{Cj\Xj,Oj) 
has the property that as long as the marginal probability P{0) is positive then 
for every state (c*, o*) of Xj there exists a state c* of Cj and a state o* of Oj such 
that p{Xj = (c*,o*)\Cj = c*,Oj = o*) = 1, i.e. p{Xj = (c*,o*),Cj = c*,Oj = 
o*) = p{Cj = c*,Oj = o*)- We have defined the distribution P{Cj\Xj,Oj) 
including the condition p{Cj = c*\Xj = {c*,o*),Oj = o*) = 1, i.e. p{Cj = 
c*,Xj = (c*,o*),Oj = o*) = p{Xj = (c*,o*),Oj = o*)- Moreover note that Xj 
and Oj are marginally independent and thus p{Xj = (c*, o*), Oj = o*) = p{Xj = 
(c*, o*)) *p{Oj = o*). All these equations together imply that p{Xj = (c*, o*)) = 
p{Cj = c*,Oj = o*)jp{Oj = o*)- We denote by Bj any set of nodes in the model 
Mx except the nodes Xj, Cj and Oj- Note that Bj can be for example the set 
of all such observed nodes. Then, from the distribution P{Cj,Oj, Bj) one can 
easily compute the distribution P{Xj,Bj) and P{Xj,Cj,Oj, Bj) = P{Xj,Bj) * 
P{Oj) * P{Cj\Xj,Oj) as well. Thus, the special fixed distribution P{Cj\Xj,Oj) 
of Mx defined above causes the nodes X to be de facto observed, too. 

Because the nodes X are observed, we can apply Theorem 2 and we obtain 
de{Mx) = de{M’l(j) + ~ 1) where M^f-. is the sub model induced 

from the model Mx by the latent nodes H and Y and observed nodes X and 
T. Because of the special parameterization of P{Y\P), this model is equivalent 
to the latent class model M^c with the special requirement that the marginal 
distribution P{Y) has to correspond to the mutually marginally independent 
nodes Pi- The question is if all these |y| — 1 — ~ 1) independent equality 
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Fig. 3. Examples a) W structure in [5] with |Oi| = 2, b) Compact polytree model with 
|Oi| = IO 2 I = lOsI = IO 4 I = IO 5 I = l^sl = 2 and |C»6| = \Hi\ = \H 2 \ = 3 



constraints decrease the effective dimension of the model M’Iq compared to the 
model Mi^c- Because all the parameters characterizing the observed marginal 
P{Y) in Mlc are independent and one can choose a basis of the Jacobian matrix 
of model M^c to contain all the corresponding vectors, each of the independent 
constrains reduces the basis by one vector. Thus, we have shown that de{Mx) = 
de{MLc) + - 1) - (n. 1^*1 - 1) + - !)• Q-E.D 

We say that a polytree M is reduced if for every latent node Hi in M after the 
addition of the X and Y nodes around Hi (as in the proof above) the latent class 
model induced by the Markov boundary of Hi is regular. If some poly tree model 
is not reduced, we can reduce it by decreasing the cardinality of the appropriate 
node Hi to satisfy the regularity constraint. 

Suppose having a non-reduced polytree model M and denote by the 
model obtained by the reduction process described above. Then the two models 
M and Mu have the same effective dimension. We show this by showing that 
it holds for a single step of the reduction process decreasing the cardinality of 
Hi. Thus, assume that only one step was needed to reduce M. Denote by M* 
and the models obtained from M and Mr by adding the nodes X and Y . 
The first part of the proof of Theorem 3 applies to any polytree because the 
node Hi is d-separated from all other nodes by its Markov boundary and it is 
exactly this boundary to which the proof applies. Thus de{M*) = de{M) and 
de(M^) = de{Mfi). And now in the two models M* and the node Hi has 
different cardinality but the same Markov boundary, which forms a latent class 
model. Again, using the d-separation of Hi from all other nodes given its Markov 
boundary and the fact that the two latent class models are equivalent, it follows 
that the two models M* and are equivalent, too. Thus, using Lemma 1 they 
have the same effective dimension, too. 

We can demonstrate the use of Theorem 3 on the W structure reported in [5] . 
The W structure consists of one latent node H, two binary observed children, 
each of them having one extra binary observed parent (see Fig. 3). It was reported 
in [5] that this structure has de = 9 for \H\ = 2, de = 10 for \H\ = 3, de = 10 
for \H\ = 4 and de = II for \H\ = 5. This was later on corrected to de = 10 
for \H\ = 5. However, no explanation of these results was available up to now. 
We can apply Theorem 3 that converts the problem into an LC problem with 
one latent variable H and two observed variables with three states. For these, 
we can use the exact solution from [11]. The result is de = 9 for \H\ = 2 and 
de = 10 for \H\ > 3 as all the LC models with \H\ > 3 are equivalent. 
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4 Decomposition of Polytrees 

It is easy to realize that by applying Theorem 2 to any polytree, one decomposes 
it into a set of compact polytrees. The sets S used for this decomposition cor- 
respond to single observed nodes that have either more than one child or some 
parents as well as a child. 

In this section, we show how to compute the effective dimension of any re- 
duced compact polytree by decomposing it into a set of reduced primitive poly- 
trees. It was explained in the previous section that we can limit ourselves to 
reduced polytrees because any non-reduced polytree can be easily converted 
into a reduced one with the same effective dimension. The following theorem 
states the main result of this section, which we prove in the rest of this section. 

Theorem 4. Let M he a reduced compact polytree with observed nodes O and 
latent nodes H . Let Mi denote the reduced primitive polytree model induced in 
M by any latent node Hi G H and its Markov boundary Mb{Hi) in M. Then 
de{M) = ds{M) - - de{Mi)). 

Proof. We prove this theorem by showing three things. First, we prove a lemma 
characterizing what a compact polytree having more than a single latent node 
looks like. Second, we prove a lemma describing a special parameterization of 
parts of reduced compact polytrees and its properties. Third, we prove a lemma 
enabling a decomposition of any reduced compact polytree into two reduced 
compact polytrees, each having less latent nodes than the original one. This 
lemma builds upon the two previous ones and directly proves the theorem above 
because it ends with a set of reduced primitive polytrees. Q.E.D 



Lemma 2. Let M he a compact polytree model having more than a single latent 
node. For any latent node Hi there is a latent node H 2 in M such that Hi and 
H 2 are either neighbors or both parents of an observed node O in M. 

Proof. M is a polytree, thus there is a unique path between any two nodes. 
Choose H 2 to be such a latent node in M that the path from Hi to H 2 in M 
doesn’t contain any other latent node. The path can thus contain only observed 
nodes or no node at all (except Hi and H 2 ). Every observed node in the path 
has at least two neighbors. This is possible in a compact polytree only if all its 
neighbors are its parents. Thus, there can not be more than a single observed 
node in the path and the lemma is proved. Q.E.D 

We define a suh polytree at a node A away from nodes i? in a polytree M 
with nodes N as the subgraph of M induced by all nodes C G N such that the 
path from 4 to C doesn’t contain any node from the set B. 

Lemma 3. Let M he a reduced compact polytree model having nodes N = HUO, 
where H are latent nodes and O are observed. Let Mu he a sub polytree of M 
at a latent node Hi G H away from a node C G Ch(Hi) consisting of nodes U . 
Let Mw be a sub polytree of M at a latent node Hi G H away from the nodes 
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Pa{Hi) consisting of nodes W. Then the sub model Mjj can he parameterized in 
such a way that P{0) determines P{0,Hf) and P{Hf) can he chosen a positive 
distribution. Moreover the sub model My/ can he parameterized in such a way 
that P{0) determines P{0,Hi) and P{Pli\{0\W)) can he any distribution. 

Proof. We present a sketch of the proof only. The proof is done by induction 
over the number of latent nodes in model M. First for a single latent node. We 
can introduce the X and Y nodes from the proof of Theorem 3. Because M is 
reduced, the induced latent class model is regular and Mw can be parameterized 
to encode a bijection between the states of Hi and the Cartesian product of all 
X nodes. For Mu one can encode a similar bijection to all X nodes but one 
and the states of Y which are restricted to distributions satisfying the marginal 
independence among Pa{Hi). We have already seen that the rest of the poly- 
tree can be parameterized to make the X and Y nodes de facto observed and 
we note that a positive distribution satisfying the marginal independence is al- 
ways possible. The nodes X and Y can be marginalized out and we obtain the 
parameterization needed for the model M and thus prove the first induction 
hypothesis. The induction step again uses a latent node Hi and the nodes X 
and Y around it. But the Pa{Hi), Ch{Hi) and Pa{Ch{Hi))\Hi in M can be 
latent nodes now. For Pa{Hi) we use the induction hypothesis of sub polytrees 
away from the node Hi, for Ch{Hi) we use the sub polytrees away from their 
parents and for Pa{Ch{Hi))\Hi we use the sub polytree away from Ch{Hi), 
resp. the C nodes. Note that for both Pa{Hi) and Pa{Ch{Hi))\Hi any positive 
marginal distribution is sufficient, while for Ch{Hi) one needs to be able to en- 
code any distribution as needed which is possible by the induction hypothesis. 
This finishes the induction step and thus the whole proof. Q.E.D 



Lemma 4. Let M he a reduced compact polytree model having nodes N = HUO, 
where H are latent nodes and O are observed. Then there is a latent node S € H, 
its child T G Ch{S), observed parents of the child Oq € O D Pa{T) and other 
latent parents of the child R G H r\{Pa(T)\{S}) in M where iJn({T}Ui?) yf 0. 
The nodes S, T, R and Oq induce in M a sub model Mg with all nodes observed. 
The sub polytree of M at the node S away from T consists of nodes Ng. The 
nodes Ng U S', T, i? and Oq induce in M a sub model M\ with the nodes T and 
R observed. The nodes (N\Ng) U {S} induce in M a sub model M 2 with the 
node S observed. For the effective dimensions of these models holds de{M) = 
de{Mi) + de{M 2 ) — ds{Mo). 

Proof. We present a sketch of the proof only due to page limit. From Lemma 2 
follows either the existence of the latent nodes S and T or the latent nodes S 
and Ri G R having a common observed child T. We consider the first case only, 
which may contain latent nodes R, too. The same proof applies to the second 
case, it is just simpler because node T is observed. Moreover, for simplicity we 
consider only a single node R, all Ri G R can be dealt with in the same way. 

The situation is depicted in Fig. 4. We denote by J the Jacobian matrix of 
the polytree model M and similarly use J\ and J 2 for Mi and M 2 . Moreover, 
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Fig. 4. Compact polytree model M and its induced sub-models 



we denote by Oq, Ot, Or and Og the marginal parameters of Oq, T, R and S and 
by Ott, Orr and Ogg the parameters of the sub polytrees at T, R and S except for 
9t, Or and Og- 

The columns of J 2 corresponding to the parameters Oo,t,s are independent 
because the variables are either observed or can be observed and encode any 
distribution if the special parameterization of Ott from Lemma 3 is used. Thus, 
there is a basis B 2 of J 2 which contains these and as many columns corresponding 
to Or as possible. Similarly, we denote by Bi the basis of J\ which contains all 
the columns Oo,t,r and as many Og as possible. Obviously, Bq contains all the 
columns Oo,t,r,s- Let B = {Bi\Bq) U (i? 2 \Bo) 0 (Bi 0 S 2 ). 

All vectors in J depend on the vectors in B because Ogg^g depend on Bi\0r 
in Ml, Orr,r,tt,o,t On B2\0g in M 2 and these dependencies imply dependence in 
B because of the d-separations. The fact that all vectors in B are independent 
is proved by contradiction. If there is a dependence then it has to hold even 
with the special parameterization of Ogg^rr,tt using Lemma 3 and this leads to a 
dependence in Bq what contradicts the fact of Bq being basis. Thus, i? is a basis 
of J. From B = {Bi\Bq) U {B 2 \Bq) U {Bi 0 B 2 ), Oo,t,r C Bi, Oo,t,s Q B 2 and 
Bo = Oo,t,r,s follows \B\ = \Bi\ + \B 2 \ - \Bo\. Q.E.D 

We can demonstrate the use of Theorem 4 on the reduced compact polytree 
in Fig. 3. This model has ds = 41. It has three latent nodes and they induce 
three reduced primitive polytree models with ds = 23, ds = 28 and ds = 20. 
These primitive polytree models have de = 17, de = 22 and de = 18. Thus the 
compact polytree model in Fig. 3 has de = 27. 

As mentioned in the introduction, effective dimensions can, in theory, be 
computed from some Jacobian matrices whose sizes are exponential in the num- 
ber of observed variables. This straightforward method nonetheless turned out 
viable in this example due to its small size. It lasted 34 seconds on a PC. In 
contrast, the use of Theorem 4 enabled us to complete the computation in 1.5 
seconds. 

5 Conclusion 

In this paper, we present two important results concerning the computation of 
effective dimensions of graphical models with latent variables. The first result en- 
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ables us to compute the effective dimension of primitive polytrees. It transforms 
the problem into the same problem for some latent class model. The second 
result enables us to decompose a polytree model into primitive sub models and 
obtain the effective dimension of the model from those of the sub models. This 
makes it feasible to compute the effective dimension of large polytree models. 
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Abstract. This paper describes some ideas for applying numerical 
trees in order to represent and solve asymmetric decision problems 
with influence diagrams (IDs). Constraint rules are used to represent 
the asymmetries between the variables of the ID. These rules will be 
transformed into numerical trees during the evaluation of the ID. The 
application of numerical trees can reduce the number of operations 
required to evaluate the ID. The paper also presents how numerical trees 
may be approximated, thereby enabling complex decision problems to 
be evaluated. 

Keywords: Influence diagrams, asymmetric decision problems, 

numerical trees, probability trees 



1 Introduction 

Asymmetric decision problems under uncertainty have traditionally been rep- 
resented and solved using Decision Trees. This tool is easy to understand and 
solve, and it encodes the asymmetries without introducing dummy states for 
the variables. Its main problem is the exponential growth of the representation. 
Influence diagrams (IDs) have been also applied to represent and solve deci- 
sion problems. The power of an influence diagram, both as an analysis tool and 
as a communication tool, lies in its ability to concisely and precisely describe 
the structure of decision problems [20]. An ID encodes the independence rela- 
tions between variables, thereby avoiding the exponential growth of Decision 
Tree representation. The ID representation has weaknesses, the most serious 
being the difficulty to deal with highly asymmetric decision problems, where 
particular acts or events lead to different possibilities [1]. In order to represent 
an asymmetric decision problem as an ID, the problem must be symmetrized 
by adding artificial states and assuming degenerated probability distributions 
and/or value functions. These adaptations obscure the structure of the prob- 
lem and, most importantly, increase the time and space required for solution. 
Several attempts have been made to solve this drawback. Call and Miller [13], 
Fung and Shachter [9], Smith et al. [20], Qi et al. [16], Covaliu and Oliver [7], 
Shenoy [19], Nielsen and Jensen [14], Demirer and Shenoy [8] have proposed 



T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 196-207, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




Applying Numerical Trees to Evaluate Asymmetric Decision Problems 



197 



modifications to the influence diagram technique in order to deal with asym- 
metric decision problems. In this paper, we follow a similar technique to Smith’s 
distribution trees [20], to represent the asymmetries between the variables. We 
use an only tool (numerical trees) to encode all the knowledge about the decision 
problem: asymmetries, probability distributions and utility functions. The main 
difference between Smith’s technique and ours is that we are able to carry out 
approximated operations with numerical trees. 

In any case, for complex decision problems, the evaluation of an ID becomes 
impossible due to its computational cost: the set of information states exceeds the 
storage capacity of personal computers or the optimal policy must be obtained in 
a short period of time. So if computational cost is prohibitive, the decision maker 
may be better off with a policy that it is not optimal. This is the main objective 
of this work: to make possible the evaluation of asymmetric and complex decision 
problems. 

Here, we combine two different lines: use of qualitative information about 
the problem (constraints, due to asymmetries) and approximation. Both of these 
can be easily used thanks to the use of numerical trees to represent conditional 
probabilities and utilities. Constraints explicitly state asymmetries and reduce 
the number of scenarios to consider. Approximation simplifies the problem and 
may be the sole solution when exact evaluation is impossible, giving a policy to 
the decision maker even though it is not optimal. 

We can find other approximated methods to evaluate IDs. For example, Lau- 
ritzen and Nilsson [12] introduce Limited Memory Influence Diagrams (LIMIDs) 
to describe multi-stage decision problems where the no-forgetting assumption is 
relaxed. This method can be seen as a way of approximate inference. Charnes 
and Shenoy [6] use a sampling technique to solve complex IDs. 

The remainder of the paper is organized in the following way: Sect. 2 intro- 
duces some concepts and notation about IDs and asymmetries, as well as com- 
putational issues faced when solving complex decision problems; Sect. 3 presents 
key issues about numerical trees and how they are used to evaluate IDs; Sect. 4 
describes the algorithms used to put our ideas into practice and the way they 
must be changed to do so; Sect. 5 includes the experimental results; and finally 
Sect. 6 details sets out our conclusions and lines for future work. 



2 Influence Diagrams and Asymmetries 

IDs are directed acyclic graphs with three types of nodes: decision nodes (mutu- 
ally exclusive actions which the decision maker must choose from), chance nodes 
(events which the decision maker cannot control), and utility nodes (represent- 
ing decision maker preferences). Links represent dependencies: probabilistic for 
links into chance nodes, informational for links into decision nodes (states for 
decision parents are known before the decision is taken), and functional for links 
into value nodes. The semantic of IDs usually assumes that the decision maker 
remembers past observations and decisions {no- forgetting assumption) , although 
some authors (Lauritzen and Nilsson [12]) relax this assumption. 
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Direct predecessors of chance or value nodes are called conditional predeces- 
sors; direct predecessors of decision nodes are designated informational prede- 
cessors. Decision maker preferences are expressed as utility functions, indicating 
the local utility for the configurations of the variables in their domain. 

The set of chance nodes is denoted Vc, the set of decision nodes is denoted Vd, 
and the set of utility nodes is denoted Vu- The set of all possible combinations 
of values for the direct predecessors of decision node D is called the information 
set for D. The elements of this set are denoted information states for D. The 
universe of the ID is F = Vc U Vd = {-^i, • ■ • , Xn}- Let us suppose that each 
variable Xi takes values on a finite set Ui containing \Ui\ elements. If / is a 
set of indices, we shall write X/ for the set of variables {Xi\i G /}, defined on 
Ui = A potential </> defined on Uj will be a mapping (f : Uj ^ M, where 

M is the set of real numbers. Potentials are used to refer to both probability 
distributions and utility functions. For probability distributions, a potential will 
be a mapping 0 : [7/ — >■ [0, 1]; for utility functions, it is a mapping (j> : Uj ^ IR. 
The sets of utility potentials and probability potentials are denoted and <P, 
respectively. A policy for an ID prescribes an action (or a sequence of actions, 
if there are several decision nodes) for each possible combination of outcomes 
of its informational predecessors. An optimal policy is a policy which maximizes 
the decision maker’s expected value. This will be the objective for ID evaluation 
algorithms. 

The drawback of using IDs to model asymmetric decision problems is well 
known. It is sometimes possible to identify the source of asymmetry and to point 
out this qualitative knowledge with relations between variables. In our solution, 
we try to keep qualitative and quantitative knowledge separate, merely because 
(as will be explained later) qualitative knowledge may affect several distributions, 
with some of them not being present in the model (i.e. distributions managed 
during the evaluation process and derived from initial ones) . On the other hand, 
we attempt to store both kinds of knowledge in similar structures, making their 
joint application easier. In order to represent the qualitative knowledge about a 
decision problem, we therefore propose that constraint rules be used. 

A constraint rule is an expression antecedent consequent. An atomic 
sentence is a pair (variable, set of values): Xi G {xi, . . . ,xj}. Atomic sentences 
can be connected with logical operators to form logical sentences. Valid logical 
operators are A (and), V (or), and -■ (not). For constraint rules, both antecedents 
and consequents are expressed using logical sentences. 

By way of example, let us suppose that X\, X2, and A3 take values respec- 
tively on the sets U\ = {x\,x\,x\}, U2 = {xl,X2}, U3 = {xl,x‘^}, then the 
constraint rule 

XiG{x\,xl}AX 2 G{xl}^X 3 G{xl} (1) 

states that if Xi is equal to x\, or xf and X2 is equal to then A3 must 
be equal to x^. Considering this constraint rule and the conditional probability 
distribution P(A3|Ai, A2), we can state that P(A3 = x\\Xi = xj, A2 = x|) = 
0. An atomic sentence could have an empty set of values for the consequent. For 
example, the constraint rule Ai G {x\} A X2 G {x^} A3 G {} means that 
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Fig. 1. Example of constraint between decision variables 



Xi = x\,X 2 = is an impossible scenario and should not be considered for 
computations. 

Sometimes the qualitative knowledge is not directly linked to any distribution 
of the model. For example, we could have the following situation: a constraint 
links the values of Di and D 2 , but there is no distribution where these variables 
take part together in the model (see Figure 1). However, during the evaluation 
process the value node will depend on D\ and D 2 and this will be the moment 
to activate the constraint. 

We can therefore distinguish two general situations: qualitative knowledge re- 
lated to initial distributions, and qualitative knowledge related to distributions 
derived from the initial ones and used during the evaluation of the model. In the 
first case, there is some redundancy between qualitative and quantitative knowl- 
edge, but we consider it is very useful to use the rules even so. Firstly, constraint 
rules make the elicitation process easier (reducing the number of scenarios and 
therefore the number of parameters to assess); secondly, they help to make both 
qualitative and quantitative knowledge consistent; and thirdly, they clearly state 
invalid scenarios, making the contingent nature of the decision problem clear. 

3 Numerical Trees 

Probability trees ([2,3,17,4,5]) have previously been used to represent probability 
distributions (probability potentials). These papers show how probability trees 
may be used in order to calculate in an exact and approximate way. The main 
advantage of probability trees is that they allow a large potential to be approx- 
imated by another of a smaller size, by collapsing several of its branches into 
a single leaf which contains the average of the values stored in the removed 
branches. In this paper, we will also use trees of numbers to represent utility 
functions and constraints rules. The trees for utility functions will be called util- 
ity trees and the trees for constraint rules will be called constraint trees. We will 
call numerical trees to all of them. 

A numerical tree T on the set of variables Xj is a directed tree, where each 
internal node is labeled with a variable (random variable or decision node), each 
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Fig. 2. a) A potential 4 > and a numerical tree representing it. b) Examples of combi- 
nation, marginalization and maximization in numerical trees 



leaf node is labeled with a number (a probability, a utility value or a 0 — 1 value 
for constraints). Each internal node has an outgoing arc for each state of the 
variable associated with that node. Outgoing arcs from a node Xi are labeled 
with the name of the associated state {xi G Ui) of Xi. The size of a tree, denoted 
size{T), is defined as the number of leaves of T. 

It can be said that a numerical tree T on variables Xj represents a potential 
4> : Ui ^ M if for each x/ G Uj the value ^(x/) is the number stored in the 
leaf node, which is reached starting in the root node and selecting the child 
corresponding to coordinate Xi for each internal node labeled with Xi. The 
potential represented by tree T is denoted by 4>q-{Xj). Given a potential </>, a 
numerical tree for (j) is denoted by 7^. Figure 2. a shows an example of a numerical 
tree for the potential (f>{Xi, X 2 , X^). This is a probability tree for the conditional 
distribution P{Xi\X 2 ,X 3 ). 

Three basic operations are necessary to evaluate IDs: combination (denoted 
with (/>i ®(j) 2 ), marginalization (denoted with and maximization (denoted 

with maxxj ^). The three operations can be carried out directly on the numer- 
ical tree representation. In [4] and [17], the authors show how to perform the 
first two operations, and how to build a numerical tree from a potential in de- 
tail. Maximization is used when a decision node is going to be removed. This 
operation is completely analogous to marginalization because it also deletes a 
variable from the numerical tree, but now instead of adding up values, we take 
the maximum. Figure 2.b shows examples for the combination, marginalization 
and maximization in numerical trees. When approximate evaluation of IDs is 
used, then a way of approximating numerical trees must be specified. In the 
following section, we shall briefly describe the approximation operation. 
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3.1 Approximating Numerical Trees 

In general, the problem consists in approximating a numerical tree 7^ by another 
tree 7^ of a smaller size. For probability trees, we will measure the distance 
between the two potentials by the Kullback-Leibler cross entropy [11] between (j) 
and (f)'. If we denote the probability distributions proportional to (f> and </>' by 
and p^i, respectively, then the Kullback-Leibler cross entropy is calculated with: 

^(T,r')= E (2) 

The Kullback-Leibler cross entropy is a distance measure for probability dis- 
tributions. In this paper, we only propose to approximate probability trees, al- 
though a way to approximate utility trees should be considered, when these trees 
are large. This aspect is a task for future work. 

One way of obtaining 'T' is to prune T . In order to prune a numerical tree, 
we select a terminal node (a node such that all its children are leaves) and we 
replace it with the average of its child nodes. The pruning operation can be 
applied again to the pruned tree T' , until we get a tree of an acceptable size, or 
while the error (distance) with respect to the original tree (7^) is below a given 
threshold. For probability trees, it can be proved that the pruning operation 
minimizes the Kullback-Leibler divergence between the original tree and the 
pruned tree [3,17]. These authors have used approximate probability trees in 
order to propagate in Bayesian networks. 

The main issue when pruning a numerical tree 'T is how to select the terminal 
nodes to prune [3,17]. We consider a threshold Ap > 0 and we then approximate 
the children of by their average if the Kullback-Leibler divergence between 
the original tree and the approximate one is less than Ap. 

We have also applied the sort operation [3,17] to numerical trees. This oper- 
ation tries to restructure the nodes of the numerical tree in such a way that the 
more informative variables are in the upper levels of the tree. In this way, if the 
tree is pruned, then only the less informative variables will be eliminated. The 
algorithm for this operation is very similar to the algorithm to build a numerical 
tree. Both operations construct the tree incrementally, by including the most in- 
formative variable (the variable minimizing the distance to the exact potential) 
at every step in the new tree. 



3.2 Constraint Trees 

From a constraint rule we can obtain a numerical tree (a constraint tree). Leaf 
nodes in constraint trees contain the values 0 or 1 . If T'^ is a constraint tree for a 
constraint rule with variables Xj, then a value of 0 in a leaf node A, means that 
the configuration of its ancestor variables corresponds to an impossible scenario 
in the ID. A value equal to 1 means that, taking into account only this constraint 
tree, the configuration is possible. A simple procedure can be implemented in 
order to build a constraint tree from a constraint rule, but it is not included 
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Fig. 3. a) A constraint rule and its associated constraint tree, b) Maximization on X 2 
in tree of Fig. a) 



here for brevity. Figure 3 part a), shows an example of one constraint rule and 
its associated constraint tree. 

Constraint rules and trees are useful when evaluating the ID in order to re- 
duce the size of the potentials (probability trees and utility trees) . This reduction 
causes that the complexity of operations (combination and marginalization) is 
also reduced. To decide if a constraint rule for the variables Xj is applicable 
to a potential </> for the variables X/, we have to check the applicability of the 
constraint rule. The applicability of the complete constraint rule depends on 
the logical operator involved with the atomic sentences of the rule. We use the 
following definitions to decide if the constraint rule is applicable. We say that 
an atomic sentence in a constraint rule for Xj is applicable to a potential (j) for 
X/ if the variable Xt of the atomic sentence is in Xj fl X/. The negation of a 
sentence is applicable if and only if the sentence itself is applicable. A conjunc- 
tion is applicable if and only if the two conjuncts are applicable. A disjunction 
is applicable if and only if at least one of the disjuncts is applicable. With these 
definitions, the constraint rule is applicable if and only if both the antecedent 
and the consequent are applicable. 

Suppose T° is the constraint tree associated to a constraint rule, and 7^ the 
numerical tree of a potential (j). If the constraint rule is applicable to </>, then 
we can combine 7*'^ and 7^ in order to obtain a smaller numerical tree. Before 
combining 7 *'^ and 7^, we must remove the variables not included in (p from 7*°. 
If Xi is one of these variables, then Xi is removed from T'^ by maximization 
in Xp. maxxi By way of example, let us suppose that we have a potential 
(utility) (j) for variables Xi and A 3 , represented with the utility tree in Fig. 4a, 
and the constraint rule in Fig. 3a. It is possible to prove that this constraint rule 
is applicable to 4>. As the variable X 2 is in T'^ but not in <j>, we must therefore 
remove X 2 from T° by calculating maxjf^ (see Fig. 3b). This new constraint 
tree can now be combined with potential (p. Figure 4b shows the resulting utility 
tree. 

The combination of numerical trees and constraint trees will be used in the 
evaluation of the ID, for the initial numerical trees (conditional distributions and 
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Fig. 4. a) A utility tree for Xi and X 3 . b) The ntility tree after applying the constraint 
rule in Fig. 3 



utility functions), and also for the new numerical trees obtained in the evaluation 
process, after eliminating a decision or chance variable. 



4 Evaluating Influence Diagrams 

In order to show how constraints and approximation can be applied to evaluate 
IDs, we have decided to work with two algorithms: arc reversal [18] and variable 
elimination [10]. This selection is based on the different nature of the operations 
they require to compute the optimal policy. 

Arc reversal (AR) uses the structure of the ID to decide the next operation 
to be done (chance node removal, decision node removal or arc reversal). Each 
operation modifies the quantitative knowledge and the structure of the ID. These 
operations are combined sequentially until all the chance and decision nodes have 
been removed. The detailed explanation for these operations can be found in [18]. 
The general schema for this algorithm is shown below: an initialization phase 
(1) is added, and also an additional step for applying constraints and pruning 
after computations (2.b). 

1. Initialization phase 

a) Build initial trees: V</) G U !?' obtain T 4 , 

b) Apply constraints: VT'^, if T° is applicable to T 4 , (and has not yet been 
applied) then 7 ^ = 7^ 0 7"°, else 7^ = 7^ 

c) Sort variables in utility and probability trees 

d) Prune all the trees with Z\ = 0 as the threshold (without approximation) : 
V(() G ^ U If', prune(7^, 0) 

2. Set a new A for approximation operations 

3. While there are conditional predecessors for utility nodes 

a) Decide next operation to do and compute 

b) Apply constraints to modified potentials, as explained in the initializa- 
tion phase (l.b). If chance or decision node was removed, sort variables 
in utility and probability trees. If arc reversal was the operation selected, 
normalize the modified distributions 

c) Prune modified trees with A as threshold: prune(7]^. A) 
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Variable Elimination (VE) uses the temporal order between the decision 
nodes to partition the whole set of nodes according to when they are observed. 
Let us consider an ID with n decision nodes. Iq is the set of chance nodes ob- 
served prior to taking any decision. li is the set of chance nodes observed after 
Di was taken and before taking That is to say, nodes are partially ordered: 
lo < Di < ... < Dn < In- li A G Vc < D G Vd then there is a directed path 
from A to D. Once this order has been established, the algorithm eliminates all 
the variables one by one, with two operations: sum-marginalization and max- 
marginalization. The detailed explanation for this algorithm can be found in 
[10]. The final schema is presented below. 

1. Initialization phase, as explained for AR algorithm (phase 1) 

2. Set a new A for approximation operations 

3. While there are nodes to remove, from /„ to Iq 

a) Decide next node to remove (N) and combine all potentials (trees) related 

to it: New trees are obtained as result: 7^ and T 4 , 

b) Apply constraints to modified potentials, as explained in the initializa- 
tion phase (l.b). Sort variables in 7^ and 7]^. If a constraint is applied 
to 7^, normalize 

c) Prune and 7^ with A as threshold: prune(7^, Z\), prune(7]^, Z\) 

It should be noted that the operations to apply constraints and to prune 
the trees (exactly or with approximation) do not change the global structure 
of the algorithms; they merely add pre-processing at their beginning and post- 
processing after operations which change the potentials. The use of these ideas 
therefore relies on the use of numerical trees to encode the quantitative knowl- 
edge, but not on the algorithm itself. 

5 Experimental Results 

For testing purposes, we have used an ID with 31 chance nodes, 2 decision nodes 
and I value node. The number of outcomes for variables range between 2 and 
9. Nodes are related with 56 different links. The total sum of the sizes of the 
ID potentials is 951. The temporal order between decisions imposes a constraint 
between its values. The ID models a short version of a medical problem where the 
decisions are related to different treatments. So, if Treatmentl decision variable 
takes the values “do not treat” or “observe and dismiss”, then the only valid 
value for Treatment2 decision is “do not treat” . 

The tests applied consists of (1) exact evaluation with each algorithm, with- 
out trees (AR and VE) and with trees: without constraints (ARWT, VEWT) 
and with constraints (ARWTC, VEWTC); (2) for both algorithms, evaluations 
with approximations for probability potentials, without constraints (AARWT, 
AVEWT) and with constraints (AARWTC, AVEWTC), varying the threshold 
(Ap) for pruning from 0.001 to 0.1, with 10 steps between these two values, 
applying the criteria explained in Section 3.1. 
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The graphics included in Fig. 5 show: (a) storage requirements for AR algo- 
rithms, with tables to store quantitative information, using trees and applying 
constraints (AR, ARWT, ARWTC). No approximation has been performed. The 
horizontal axis indicates the number of operations required to complete the eval- 
uation; (b) as (a), for VE algorithms. The curve for ARWTC (the most econom- 
ical storage size) is included in order to compare the performance of both fami- 
lies of algorithms; (c) time of computation for the six tests; (d) maximum size 
reached during approximate evaluation, for AARWT, AARWTC, AVEWT, and 
AVEWTC. The horizontal axis represents the thresholds used for the approxima- 
tions; (e) as (d), with time of computation; (f) errors regarding expected utility 
values computed without approximation, for AARWT, AARWTC, AVEWT, and 
AVEWTC. 

Experiments show that the VE requires less storage space than AR in all 
the versions of these algorithms (compare the exact methods: VE-AR, VEWT- 
ARWT, VEWTC-ARWTC in Fig. 5a and 5b; and the approximate ones: 
AVEWT- AARWT, AVEWTC- AARWTC in Figure 5d). The use of numerical 
trees reduces storage space in both algorithms (see Figs. 5c, 5b, 5d). Constraints 
reduce the required space even further (compare VEWT-VEWTC and ARWT- 
ARWTC in Fig. 5d). Storage requirements can be controlled using a parame- 
ter A obtaining good approximations to the expected utility (see Figs. 5d and 
5e). Approximate evaluation VE requires less time than AR in all versions (see 
Figs. 5c for exact evaluation and 5e for approximate). Approximate VE obtain 
fewer errors in the expected utility than AR (see Fig. 5f). These algorithms were 
implemented in Java with the Elvira tools (http : / /leo . ugr . es/~elvira) . The 
tests were run on a Pentium 4 computer (2GHz) with Linux Red Hat 7.3 oper- 
ating system. 

6 Conclusions and Future Work 

In this paper we have presented some ideas for evaluating complex influence 
diagrams. These ideas include the use of numerical trees to represent and com- 
pute with probability distributions and utility functions. We have also specified 
asymmetries by means of constraint rules and constraint trees, in order to reduce 
the size of the scenarios to be considered. The paper also shows how to use a 
deterministic approximate method for very complex problems. This allows us to 
compute a policy although it will not be optimal. 

We conclude that the use of numerical trees for evaluating IDs requires less 
storage space than traditional methods. If we also use constraint trees, we further 
reduce the space required to compute the models. When the problem is too 
complex, we can use pruning operations, obtaining good approximate results. 

In future lines of work, we shall study additional pruning methods for utility 
trees, e.g. pruning terminal nodes with standard deviation (regarding the maxi- 
mum possible) below a given threshold. We are now considering the application 
of approximate methods in junction trees, for penniless [4] and lazy penniless [5] 
algorithms. 
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Fig. 5. Experimental results 
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Abstract. This paper presents an architecture for exact evaluation of 
influence diagrams containing a mixture of continuous and discrete vari- 
ables. The proposed architecture is the first architecture for efficient ex- 
act solution of linear-quadratic conditional Gaussian influence diagrams 
with an additively decomposing utility function. The solution method 
as presented in this paper is based on the idea of lazy evaluation. The 
computational aspects of the architecture are illustrated by example. 



1 Introduction 

The framework of influence diagrams [1] is an effective modeling firamework for 
analysis of Bayesian decision making under uncertainty. The influence diagram 
is a natural representation for capturing the semantics of decision making with a 
minimum of clutter and confusion for the decision maker. An influence diagram is 
essentially a Bayesian network augmented with decision variables, utility nodes, 
and precedence constraints. Solving a Bayesian decision problem amounts to 
determining an optimal strategy maximizing the expected utility for the decision 
maker. Determining an optimal strategy is, unfortunately, a computationally 
intensive task to solve. 

Different architectures for solving discrete Bayesian decision problems have 
been proposed [12,15,2,7,5,9]. Many real-life decision problems do, however, 
involve reasoning about uncertain entities and decisions, which take on values 
in continuous ranges. Few architectures for solving continuous Bayesian decision 
problems exist [3, 14]. Even fewer architectures deal with the mixed case. In [11] 
an architecture where arbitrary continuous distributions are approximated using 
(artificial) mixtures of Gaussians is described. 

We present a computationally efficient architecture for representing and solv- 
ing mixed Bayesian decision problems. The architecture is for simplicity of ex- 
position based on formulating Bayesian decision problems as mixed influence 
diagrams and the solution method is an extension of Lazy propagation [7]. The 
proposed architecture can be considered as an extension of conditional-Gaussian 
Bayesian networks and discrete influence diagrams to the case of linear-quadratic 
conditional Gaussian influence diagrams with decomposing utility functions. 
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Even though we present the solution method of the architecture as a message 
passing algorithm based on Lazy propagation, the results are applicable to most 
other algorithms (including variable elimination). A detailed treatment of the 
general applicability of the results is, however, outside the scope of this paper. 



2 Mixed Influence Diagrams 

A mixed influence diagram N is deflned as a triple N = (G,<1),^) where G = 
(V, E) is a acyclic, directed graph (DAG) with chance, decision, and utility (value) 
nodes, O is a set of probability functions, and is a set of utility functions. 

The nodes V of G are partitioned into the set of continuous nodes P, the set 
of discrete nodes A, and the set of utility nodes T, i.e. V = P U A U T. The set 
of decision nodes is denoted as V and the set of chance nodes as C. We define 
subsets Ac = {X e A : X e C} and Ap = {Y e A : Y € T>}. Subsets Pc and Pp are 
defined similarly. We will refer to nodes and variables interchangeably. 

An arc into a node X £ C denotes a possible probabilistic dependence relation 
whereas an arc from a node Y into a node D £ D indicates that Y is observed 
prior to decision D. Arcs into decision nodes are referred to as information arcs. 
The information arcs of N induce a partial precedence order ^ on Pu A of N s.t. 
lo TV Xtv where Xr is the set of chance variables observed 

after Di but before Di+i . We assume a total order on the decision variables and 
a non-forgetting decision maker. 

We consider the case of linear-quadratic conditional Gaussian influence dia- 
grams. This implies pa(X) C A whenever X £ A and pa(X) C A U P whenever 
X £ Put. Each chance node X £ A has a conditional probability distribution 
P(X|pa(X)) whereas each chance node X £ P is linear Gaussian conditional on 
pa(X) n A. Each value node U £ T has a local utility function which assigns a 
value to each configuration of its parents in the discrete case and is a conditional 
quadratic function of its continuous parents in the mixed case. The set of local 
utility functions represents an additively decomposing utility function. 

Fig. 1 shows an example of a mixed influence diagram N where the continuous 
nodes are indicated using a wider border. 




Fig. 1: A mixed influence diagram N 



To solve a mixed influence diagram N is to determine an optimal strategy 
.^ = {6i , . . . , 6|p|} consisting of a policy 6 for each decision D £ P and to compute 
the maximum expected utility of adhering to B. 
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The computations involved in solving a mixed influence diagram N can be 
organized in a strong junction tree representation T of N. We abstain here from 
describing all details of the compilation process. Instead we refer the reader 
to [2] for details on the junction tree compilation process. The construction of T 
consists of four main steps: 

Minimalization where information arcs from non-requisite parents of decision 
nodes and barren variables are removed. 

Moralization where all pairs of parents with a common child are married by 
inserting an undirected edge between them. The graph is made undirected 
and all utility nodes are removed to obtain G^. 

Triangulation where is triangulated to obtain G^ with an elimination 
order a such that o-(r) < ct(A) and a(Ji) > a(Di+i) > o(2i+i) for all 
i € [0,n — 1]. That is, G^ is a strong triangulation of G^. 

Junction tree construction where the cliques C of G^ are organized as a 
strong junction tree T = (C,5) with strong root R. 

Fig. 2 shows a mixed influence diagram representing a simple game. The first 
decision is to either accept an immediate award or to play a game where you 
will receive a payoff determined by how good you are at guessing the height of a 
person based on knowledge about the sex of the person. The payoff is a constant 
(higher than the award) minus the distance of your guess from the true height 
of the person measured as height minus guess squarred. 




Fig. 2: A mixed influence diagram for a simple game 



2.1 Conditional Linear Gaussians 

Let X be an n-dimensional continuous variable with discrete parents I and con- 
tinuous parents Z, then X has a conditional linear Gaussian distribution if: 

£(X| Z = z, I i) = AT(A{i) + B(i)z, C(i)), (1) 

The mean vector of X depends linearly on the states of the continuous parent 
variables Z, while the covariance matrix is independent of Z. In equation 1, X is 
an n X 1 -dimensional vector, A(i) is a table of n x 1 -dimensional vectors, B(i) is 
a table of n x |Z|-dimensional matrices, and C(i) is a table of n x n-dimensional 
positive semi-definite matrices. 

We will use a notation similar to that of [4]. A cg-potential p(i) • £(X1 Z = 
z, I = i) is represented as p(i) and A = [A, B, C]({X| | {Z, I}) where ({X),{Z, 1}) 





Mixed Influence Diagrams 211 



is a partitioning of the domain variables dom(A) of A into head H(A) = {X} 
and tail T(A) = {Z, I}, respectively. Notice that we have separated the discrete 
part p(i) of the cg-potential from the continuous part A. We define P C O as 
V = {p(X|pa(X)) : X e Ac) and A C O as A = {Z(X|pa(X)) : X e Tc). 

Consider a cg-potential A(Xi , X2 1 Z) = [A, B, C] and let 



A = 





and C = 



(Cu C,2\ 

VC21 C22/ 



be a partitioning of A, B, and C relative to X] and X2, respectively. The strong 
marginal of A w.r.t. Xi is A(Xj | Z) = [Ai , B] , Ci 1]. Using : to indicate that the 
matrix is extended with an additional column for each new tail variable, the 
complement of A w.r.t. Xi is: 



A(X2 IX, , Z) = [A2 - C2, C7-; A, , [C21 Cf,’ : B2 - C21 C^; B,], C22 -C21 C7,’ €,2] . 

Let X, I Z = z - A^(A, + B,z,C,) and X2 | X, = x,,Z = z ~ M(A2 -b 
B2 iX, + 622^,02) have cg-potentials A, and A2, respectively. The combination 
A(X, , X2) = A, ®A2 = [U, V, W] is: 



V^2 + B2,A,y>'^ I^B22 + B2 iB,J’'^'' C2 + B2^C^BlJ■ 

The combination A,®A2 is only defined, when H(A2) n dom(A, ) = 0 or 
H(A, ) n dom(A2) = 0 . Recursive combination A, ® A2 is used to combine A, and 
A2 when direct combination is not possible. The potentials A, and A2 are de- 
composed recursively until direct combination can be applied, see [ 4 ] for details. 
For notational convenience a combination A, <g> • • • igi An is written as OiL, 

We assume the tail of all cg-potentials to be minimal [ 4 ]. 



2.2 The Quadratic Utility Function 

In the linear-quadratic conditional Gaussian influence diagram, the utility func- 
tion U(r, A) is a second-order polynomial in V conditional on A. Thus, the utility 
function has the form U(X = x, I = i) = x^Q(i)x -1- R(i)x -f- S(i), where X is a 
|X| X 1 vector of continuous variables, I C A, Q(i) is a table of |X| x |X| symmetric 
negative semi-definite matrices, R(i) is a table of 1 x |X| vectors, and S(i) is a 
table of constants. 

The utility function U(X,I) may decompose into a set of simpler terms 
T = {rl>i,...,tpm} s.t. U(X,I) ^ Zhi where Xj C X, Ij C I, and 

each r|jj(Xj,Ij) has the form xfQ(ij)xj -(- R(ij)xj + S(ij). Each term ih(X, I) is 
represented as [Q,R,S]({X, I}) where dom(xl)) = X U I with X C P and I C A. 
Notice that special care should be taken in the specification of Q = [qq] since 
qij = ^Qxi.Xj when i ^ j and XiQxi.Xj’^j is a term of U. 

Let ihi and ij)2 be two utility functions such that i),, = [Qi , R, ,S,]({X,I}) 
and tjj2 = [Q2, R2tS2]({X, I}) after proper domain extensions. The combination 
-f 1^2 is a utility function [Q, -|- Q2,Ri + R2.S, + S2]({X, I}). 

Recall, that a polynomial Qx^ -b Rx 4- S with Q < 0 takes on its maximum 
value at its vertex v = (x,y) where x = — ^ and y = -b S. 
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3 Solving Mixed Influence Diagrams 



The maximum expected utility of a mixed influence diagram N can be calculated 
by eliminating the variables of N in reverse order of the precedence ordering 
under the constraint that afP) < a(A) where a is the elimination order and the 
first variable to eliminate Y has ct(Y) = 1 . Prom = PU A and W, the maximum 
expected utility MEU is computed as: 

MEU = M n*L (2) 

X€V (|)60 



where M is a further generalization of the generalized marginalization opera- 
tor introduced by [2]. Here the operator is generalized such that a continuous 
chance variable X is eliminated by integration, i.e. Mx P Jx P^^’^ whereas dis- 
crete chance variables are eliminated by summation and decision variables are 
eliminated by maximization as usual. The operator is defined precisely below. 

Following the approach of [7], we assume that the first variable to eliminate 
according to the strong elimination order cr is Y, i.e. a(Y) = 1 . Let Wy be the 
subset of including Y in the domain, i.e. = {rl> e '1' : Yg doTa(x|>)} and let d>y 
be the subset of <I> including Y in the domain, i.e. Oy = {4> S <1> : Y G dom(cj))}. 
Define 4 >y and tl>y as follows: 



(3) 



(4) 



Y 4,e(t>v y \ VY / 

With these definitions, equation 2 is rewritten as: 

MEU = Mdl'f’L’i-) 
xev ' (|)ei> 

= m[ n ‘t’ n L L 

X€V (1)'60y >e4'\'Vv ^ 

M n ■*>' L ’)>' 

M [ n oM n L 



X€V\{Y} ‘-<J)e®\<I)v Y ())'6<Dy xtGl'X'i'Y 



MHO' 

Y ())'€<t>Y 



n i>' 



; L ’!■' 



M <t>v n '!>( L ^ 

xev\{Y) L 4)e<t)\<DY ^evWY y \ | 111^2 i^'g^Y 

Y <|)'€<t)Y 



= M 4>y n * L + li)Y 

xev\{Y}L <i>€<i>\®y ^e'i'X'i'Y 



(5) 



The sets = (O \ Oy) U {4>y] and = (W \ M'y) U {i)>y} are the updated 
sets of probability and utility potentials obtained after the elimination of Y. The 
evaluation of N proceeds in a similar manner for the remaining variables. 
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If Y is a continuous variable, then Xri.g'i'Y ^ either a constant, a first order, 
or second order polynomial in Y. If Y is a discrete variable, then ^ 

constant in Y. If Y is a decision variable, then n<t)et>y ‘t’ considered as a function 
of Y alone is a non-negative constant. This implies that the elimination of Y is 
simple. This observation is due to Lemma 1 of [2]. 

Consider the division of probability potentials in equation 3. If Y € Pc, then 
either Y is probabilistic barren and no division is necessary or the division cor- 
responds to the complement operation described in section 2.1. 

During the evaluation of N, we eliminate variables from the combination 
of probability and cg-potentials, and from the combination of probability, eg, 
and utility potentials. Notice, however, that if the variable Y to be eliminated 
next is a continuous variable, then no probability potential is involved in the 
elimination. Similarly, if Y is a discrete variable, then all (relevant) continuous 
variables have been eliminated and no cg-potential is involved. This is due to 
the model structure constraints. 

The constraint ct(A) > cr(r) on a can be relaxed. The relaxation can be 
explained in terms of the topological of G = (V, E). Let P(Di) and .?^(Di) denote 
the past and future of Di, respectively. That is, 'P(Di) = U|Io 2’jU{Di , . . . , Di_i) 
and .F(Dv) = , • • • , Dn}- The requisite past Rq(Di) and relevant 

future Rl(Di) of Di are defined as [13, 10]: 

Rq(Di) = {X € P(Di) : X / de(Di) n Tq, |pa(Di) \ {X}}, 

Rl(Di) = {X e :^(Di) : X / de(Di) n To, Ipa(Di)}, 

where de(Di) are the descendants of Di in the graph of the minimalization of N 
ignoring information arcs. 

A discrete variable X is not allowed to be relevant Rl for a continuous decision 
variable D, but X is allowed to be requisite Rq for D 6 P. That is, the condition 
Rl(D) n A = 0 must be satisfied whereas there is no constraint on Rq(D). 

The above derivation is based on the assumption that variables can be elim- 
inated using efficient local operations. Such operations are derived next. 

3,1 The Marginalization Operator 

As continuous variables are eliminated before discrete variables, the elimination 
of a discrete variable involves either maximization or summation over a discrete 
function. Hence, the set of cg-potentials is empty. This implies that elimination 
of a discrete variable proceeds as in the case of pure discrete influence diagrams, 
see e.g. [7] for details. 

Definition 1. The operation of marginalization o/ a discrete variable X is 
defined as = Y.x- 

Definition 2. The operation of marginalization o/ a discrete decision vari- 
able X from a utility potential y\)(X, I) is defined as ^ maxx 4>. The optimal 

policy 6x(i) forX is: 



6x(i) = argmaxfi)x(x,i). 
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Consider the elimination of a continuous decision variable X from a utility 
potential i|) = such that ih = [Q,R,S]({Y, I}). For each i, the Q(i) 

matrix is partitioned relative to the decision variable X as follows (assuming 
Y=(X,2T)T): 



Q(i) = 



{Qxxii] Qxz(i)^ 
VQzx(i) Qzz(i);' 



where Qxz = Qzx- Similarly, for each i, the vector R(i) is partitioned relative 
to X into Rx(i) and Rz(t). With this partitioning, we get: 



M = M y^Q(^)y + R(^)y + s(i) 

X X 

= z^Qzz(i)z + Rz(i)z + s(i) 

+ M ’<^^Qxx(i)x + Rx(i)x + 2x^Qxz(i)z- 
X 



Considered as a function of x, tl>(y,i) takes on its maximum value (i.e. 
argmaxxijj(y,i)) at x = _ 3 x ( ^ ^ for each i assuming that Qxx(i) < 0 
(or tHy , i) is constant in x in which case tj) is not a function of X). The maximum 
value of rMy,i) at x (i.e. maxxrl)(y,i)) is; 

,, (Rx(i) + 2Qxz(i)z)^ , / , c/M\ 

m^rKy,r) = — ^ ^ + (z' Qzz(x)z + Rz(x)z + S(t)). 

The result of maxxrjj(y,i) is a utility potential tHz,i) = [Q*, R*,S*]. 



Definition 3. The operation of marginalization of a continuous decision vari- 
able X from a utility potential ih(y,i) = [Q,R,S] is defined as Mx^ 
Mx^ = [Q*.R*.S*]({z,i)) where: 






The optimal decision policy 6x(z,i) for X is: 



6x(z,i) = - 



Rx(i) + 2Qxz(i)z 
2Qxx(i) 



Notice that the constraint that Q is negative semi-definite can be relaxed. It 
is sufficient that Qxx(t) < 0 for all i when eliminating the continuous decision 
variable X since this implies that the second order polynomial has a unique 
maximum with respect to X. 

There are two cases to consider when defining the operation of marginaliza- 
tion of a continuous chance variable X. First the simple case where X has to be 
eliminated from a cg-potential. 




Mixed Influence Diagrams 215 



Definition 4. Marginalization of a continuous chance variable X from a cg- 
potential X is strong marginalization ^ = J"x 

In the general case we need to eliminate X from a combination (j) * 4* of a cg- 
potential <{) = [A, B,C]({X}|{Z, I}) and utility potential rj) = [Q, R, S]({Y, I}). We 
assume proper domain extensions have been made such that dom(A) = dom(4r). 
Let R and Q be partitioned relative to X and Z as described above. The properties 
E(X) = p. and E(X^) = of the normal distribution are exploited when 

eliminating X ~ N(p, a^). The variable X is eliminated from (cj) * \|))(y,i) as 
follows: 

<J) * 4> = M * (y^Q(f)y + R(f)y + s(i)) = z^Q*H)z + R*(i)z + s*(i). 

X X 

Definition 5. Let 4> and \\> be a cg-potential and a utility potential, respectively. 
The operation of marginalization of continuous chance variable X from (tj) * 
4r)(y,i) *5 de/ined as 1^x4^ = [Q*.B*.S*]({Z, I}) where: 

Q*(i) = Qzz(i) + 2Qzx(i)B(i) + Qxx(i)(B(i)''B(i)], 

R*(i) = Rz(i) + 2A(i)Qxz(i) + (2Qxx(i)A(i) + Rx(i))B(i), 

S*(i) = Qxx(i)(A(i)2 + C(i)) + S(i) + Rx(i)A(i). 

4 Lazy Propagation 

Lazy propagation in an extended version can be used to solve N efficiently by 
message passing in a strong junction tree representation T of N. In this section 
we present the extensions of Lazy propagation necessary to solve N . The main 
idea of Lazy propagation is to maintain decompositions of clique potentials until 
combination becomes mandatory by a variable elimination. Therefore, the notion 
of a potential is introduced: 

Definition 6 (Potential). A potential onW CV is a pairitw = (4^,'!') where 
^ is a set of non-negative real functions on subsets ofW and W is a set of real 
functions on subsets ofW. 

The probability part 0 = {pi} U {Xj} of a potential is a set of probability 
potentials {pt} and cg-potentiaJs {A;} whereas the utility part is a set 

of local utility functions as defined above. We call a potential ttw vacuous, if 
7Tw = (0,0). We define new operations of combination and marginalization. 

Definition 7 (Combination). The combination of potentials nw, = (Or.y/i) 
and 7t W 2 = (4^2>y^2) denotes the potential on Wi U W 2 given by 7tw, (g> 7rw2 
where Ttw, = (<l>i U 02 ,y:'i U''i' 2 )- 

Definition 8 (Marginalization). The marginalization of nw = (C>,Y) onto 
W \ W] is defined as: 
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where: 



^w, = M O ‘f’’ 

w, (t)e<i>w, 



4>Wi = 




n<t)e<Dw, 

‘i’w, 



Y 

'P€M'w, 



= {cj) € : Wi n dom(({)) ^ 0}, and 'J'w, = (^l) G 'i' : Wi D dom(il>) ^ 0). 

We use the convention that 0/0 = 0. 



4.1 Initialization 

The first step in initialization of T = (C,5) is to associate a vacuous potential 
with each clique C £ C. Then, for each chance variable X, cj)(X | pa(X)) is as- 
sociated with the probability part of ttc for any clique C satisfying C D fa(X). 
Similarly, for each utility node U, i|)(pa(U)) is associated with the utility part 
of 7Tc for any clique C satisfying C D pa(U). 

After initialization each clique C holds a potential ttc = (O,^) and the joint 
potential on T is: 

7tv = (Ov.^v) = 

cec 

Notice, that Uc|,€<i>c dom(ct)) U UtpeH-c dom(4>) C C for ttc = (Oc.H'c). 

4.2 Message Passing 

A mixed influence diagram N is solved by message passing in T via the separators 
of T. The separator between two neighboring cliques A and B is AfiB. Messages 
are passed from the leaf cliques of T to the strong root R by recursively letting 
each clique A pass a message to its parent B whenever A has received a message 
from each of its children. The message 7 Ca- 4B is passed from clique A to clique B 
by absorption. Absorption from A to B involves eliminating the variables A \ B 
from the combination of the potential associated with A and the messages passed 
to A from its neighbors ne(A) except B. The message 7tA->B is: 

7tA-»B = (tTA ® (®C6Tve(A)\{B)7tC-^A))'*''^ , 
where tic-ia is the message passed fi’om C to A. 

Theorem 1. Suppose we start with a joint potential Tty on a strong junction tree 
T, and pass messages toward the root clique R as described above. When R has 
received a message from each of its neighbors, the combination of all messages 
with its own potential is equal to the R-marginal of ny: 

Tty^ = (®C€C7tc)'*’’^ = TTr ® (l8)cene(R)7IC-»R)> 



where C is the set of cliques in T . 




Mixed Influence Diagrams 217 



4.3 Local Optimization and Local Computation 

The maximizing alternatives of the utility potential from which decision vari- 
able D is eliminated during evaluation of N are recorded as the optimal decision 
rule for D, which is either a constant of a linear function in the parents. 

A message passed from clique A to clique B is as explained above computed 
by eliminating all variables of A \ B. The structure of the strong junction tree 
T imposes a partial order on the variable elimination order. The structure of T 
does not impose any constraints on the order in which the variables of A \ B 
have to be eliminated, but the constraint cr{A) > ofr) has to be satisfied. Thus, 
the variables of A \ B can be eliminated in any legal order when computing the 
message to pass from A to B. 

In [8, 7] it is described how independence relations and probabilistic barren 
variables can be exploited to decrease the computational cost of message passing 
in Bayesian networks and discrete influence diagrams, respectively. Independence 
relations and barren variables can be exploited during the solution of mixed 
influence diagrams in a similar way. 

4.4 Example 

Consider the mixed influence diagram N of Fig. 3 and its corresponding strong 
junction tree T shown in Fig. 4. The initialization of T proceeds as explained in 
section 4.1. The precedence order on the chance variables is: {I2] X Di ^ {X2} ^ 
D2 ^ {Ii > , X3) which does not satisfy A -«< F. Notice, however, that Rq(Di ) = 

{I2}, Rq(D2) -{l2,Di,X2}, Rl(Di) = {I,,X,,X2,D2.X3}, and RKD2) =1X3}. 
Thus, even though Ii € I2 it is possible to solve N since Ii ^ Rl(D2). 




Fig. 3: An extension of the mixed influence diagram N shown in Fig. 1 



In order to solve N messages are passed in T from the leaf clique C 3 = 
□ 1X1X3 to the strong root Ci = l2DiIi as follows (where C2 = I2D1X2D2X1 
and S 23 = C 2 n C 3 = {Di ,Xi}): 

=^c“ =(0 .^(Xi.D2)) = (0.{M'^2(X3)MX3|Xi,D2)}). 

Xj 

Notice, that X3 is probabilistic barren implying that the probability part of 
7ts23 is the empty set. The second message is computed as: 

= Nc 2 <8>7rs33p'-' = (0,{iKl2,Di)}). 
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Fig. 4: Strong junction tree representation of N in Fig. 3 



To obtain the maximum expected utility and the optimal strategy, we com- 
pute (ttv)'*’® = (tic, The structure of T induces the elimination order. 

Due to the simplicity of the example there is no freedom w.r.t. selecting the 
on-line elimination order. 

5 Discussion and Conclusion 

The work by [3, 14] on linear-quadratic-Gaussian influence diagrams imposes a 
number of restrictions on the structure of the influence diagram. All (chance) 
variables are Gaussian distributed, the interaction between chance variables is 
linear, and the sources of uncertainty are Gaussian distributed and uncorrelated. 
These restrictions are similar to the restrictions our architecture puts on the 
continuous variables. 

In the architecture of [14], the utility function is assumed to decompose 
additively into a set of local utility functions. Prior to evaluating the influence 
diagram, the local utility functions are combined into a single utility function 
with a domain equal to the union of the set of chance and the set of decision 
variables. This more or less corresponds to assuming that all variables in the 
model condition the utility function. This assumption is made due to difficulty 
with maintaining minimal conditional predecessors and not letting the utility 
function be a part of the influence diagram. This implies, for instance, that the 
architecture is not able to exploit probabilistic barren variables. Barren variables, 
on the other hand, can be removed from the diagram as a preprocessing step. The 
solution method is based on arc-reversal. During the evaluation, the expected 
value of the value function is maintained as the influence diagram is transformed 
via reduction operations. In this way, the architecture maintains a valid influence 
diagram representation during the evaluation of the decision problem where the 
utility function is maintained outside the influence diagram. 

The work of [11] on mixed influence diagrams is based on an (artificial) mix- 
ture distribution approach. The distribution of a continuous variable is approx- 
imated using an mixture of Gaussians. The structure of the influence diagram 
models are constrained under the same conditions as the architecture we present. 

Both of the above architectures are based on the central-moment representa- 
tion whereas the architecture we propose is based on the raw-moment represen- 
tation. This gives a few differences with respect to specification of the decision 
problem. For instance, it is necessary to make additional passes over the struc- 
ture using the central-moment representation. 

We have presented the first junction tree based architecture for solving in- 
fluence diagrams with a mixture of continuous and discrete variables, i.e. linear- 
quadratic conditional Gaussian influence diagrams. The architecture which is 
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based on exact computation is also the first architecture to eflficiently exploit 
(additively or multiplicatively) decomposing utility functions. This makes the 
architecture more efficient than the architecture of [14], for instance. 

Returning to the example of the simple game with two decisions, the optimal 
strategy is, of course, to stay in the game make and a guess on the height, which 
is the average height of a person with the given sex. 

Recently, there has been some development in representation and solution of 
Bayesian network models containing both discrete and continuous variables, see 
e.g. [6]. This includes discrete children of continuous variables and arbitrarily dis- 
tributed continuous variables. Extending these methods to the case of influence 
diagrams is an interesting topic of future research. 



References 

1. R. A. Howard and J. E. Matheson. Influence Diagrams. In The Principles and 
Applications of Decision Analysis, volume 2, chapter 37, pages 721-762. 1981. 

2. F. Jensen, F. V. Jensen, and S. Dittmer. From Influence Diagrams to Junction 
Trees. In Proc. of the 10th UAI, pages 367-373, 1994. 

3. C. R. Kenley. Influence Diagram Models with Continuous Variables. PhD thesis, 
EES Department, Stanford University, 1986. 

4. S. L. Lauritzen and F. Jensen. Stable Local Computation with Mixed Gaussian 
Distributions. Statistics and Computing, ll(2):191-203, 2001. 

5. S. L. Lauritzen and D. Nilsson. Representing and solving decision problems with 
limited information. Management Science, 47:1238-1251, 2001. 

6. U. Lemer, E. Segal, and D. Roller. Exact Inference in Networks with Discrete 
Children of Continuous Parents. In Proc. of the 17th UAI, pages 319-328, 2001. 

7. A. L. Madsen and F. V. Jensen. Lazy Evaluation of Symmetric Bayesian Decision 
Problems. In Proc. of the 15th UAI, pages 382-390, 1999. 

8. A. L. Madsen and F. V. Jensen. Lazy propagation: A junction tree inference 
algorithm based on lazy evaluation. Artificial Intelligence, 113(l-2):203-245, 1999. 

9. A. L. Madsen and D. Nilsson. Solving Influence Diagrams using HUGIN, Shafer- 
Shenoy and Lazy Propagation. In Proc. of the 17th UAI, pages 337-345, 2001. 

10. T. D. Nielsen. Decomposition of Influence Diagrams. In Proc. of the 6th EC- 
SQARU, pages 144-155, 2001. 

11. W. B. Poland. Decision Analysis with Continuous and Discrete Variables: A Mix- 
ture Distribution Approach. PhD thesis, Engineering-Economic Systems, Stanford 
University, Stanford, CA, 1994. 

12. R. Shachter. Evaluating influence diagrams. Operations Research, 34(6):871-882, 
1986. 

13. R. Shachter. Bayes-Ball: The Rational Pasttime (for Determining Irrelevance and 
Requisite Information in Belief Networks and Influence Diagrams). In Proc. of the 
Uth UAI, pages 480-487, 1998. 

14. R. D. Shachter and C. R. Kenley. Gaussian influence diagrams. Management 
Science, 35(5):527-549, 1989. 

15. P. P. Shenoy. Valuation-Based Systems for Bayesian Decision Analysis. Operations 
Research, 40(3):463-484, 1992. 




Decision Making Based on Sampled Disease 
Occurrence in Animal Herds 



Michael Hohle^’^ and Erik J0rgensen^ 

^ Department of Animal Science and Animal Health, Royal Veterinary and 
Agricultural University, Grpnnegardsvej 3, 1870 Frb. C, Denmark 
hoehle@dina . dk 

^ Department of Animal Breeding and Genetics, Danish Institute of Agricultural 
Sciences, Research Gentre Foulum, PO Box 50, 8830 Tjele, Denmark 
Erik . JorgensenSagrsci . dk 



Abstract. To make qualihed decisions when extrapolating results from 
a survey sample with imprecise tests requires careful handling of uncer- 
tainty. Both the imprecise test and uncertainty introduced by the sam- 
pling have to be taken into account in order to act optimally. This paper 
formulates an influence diagram with discrete and continuous nodes to 
handle an example typical for animal production: a veterinarian who - 
as part of a biosecurity program - has to decide whether to treat a herd 
of animals after inspecting a small fraction of them. 

Our aim is to investigate the robustness of the obtained strategy by per- 
forming a two-way sensitivity analysis with respect to the proportion of 
false positives and false negatives of the test. Output of the analysis is 
a treatment map illustrating how the chosen strategy varies according 
to variation in these proportions. The map helps to investigate whether 
a certain variation is acceptable or if the test procedure has to be stan- 
dardized in order to reduce variation. Objective of the paper is to be an 
appetizer to work more with the issues raised in obtaining a practical 
solution. 



1 Introduction 

Traditional survey sampling as e.g. in [1] is concerned with establishing the 
proportion of individuals having a specific characteristic in a population. This 
is done by extrapolating results from a sample to the entire population. In the 
traditional case, investigation of each individual in the sample will reveal its true 
state, i.e. as either having the property or not. In many practical applications 
such precise answers are not available - the test is imprecise thus introducing 
both false negatives and false positives. An example from the veterinarian field 
is the use of a diagnostic test to determine the disease prevalence of a herd. 
The task of establishing the disease status of a herd is typical for biosecurity 
programs, e.g. for salmonella in pigs or Johne’s disease in cattle [2,3]. Similar 
examples are found in clinical decision making or when testing for GM-seeds in 
seed lots [4,5]. Estimates on disease prevalence, 0 < p < 1, need to take the 
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sensitivity and specificity of the diagnostic test into account, i.e. respectively 
the fractions of diseased and non-diseased cases correctly diagnosed by the test. 
In practical situations these fractions can be hard and resource demanding to 
establish for a test method. Even worse, they are also open for a great deal 
of variation. For example when different veterinarians have to determine herd 
prevalence of e.g. pneumonia or diarrhea in a section of slaughter pigs [4]. 

If the same test is performed in all cases, an investigation could be performed 
to establish the sensitivity and specificity {Se, Sp) of the test procedure. With the 
uncertainty in p due to sampling taken into account a biosecurity program could 
recommend the following treatment strategy: Treat all animals in the section at 
cost Cd if P is above some threshold T and do nothing if p < T. Based on the true 
prevalence and treatment chosen at the current time stage a reward is given. The 
aim is to choose the threshold maximizing the expected reward. Even though we 
find the optimal T, and recommend the strategy to all veterinarians, we would 
not take into account the variability in (Se, Sp) due to each veterinarian making 
his own subjective clinical diagnosis for every investigated individual. Assume 
the specific veterinarian has a true (but unknown) setup of (Se + S, Sp + e). If 
he follows the threshold based on (Se, Sp) he might not achieve the maximum 
expected utility because his uncertainty in p is of a different magnitude and 
shape. 

Current biosecurity programs, e.g. the voluntary herd status program against 
Johne’s disease [2], operate with fixed point estimates on sensitivity and speci- 
ficity of the diagnostic test. To assess the impact of the above variability a proper 
sensitivity analysis should therefore be an integral part of the modeling. Methods 
such as one-way and two-way sensitivity analysis, tornado, rainbow diagrams, 
etc., provide valuable insights about implementational robustness of an optimal 
strategy [6]. Another approach would be to quantify uncertainty on sensitiv- 
ity and specificity by distributions [7,8]. Our interest, however, is the decision 
analytic dimension of the problem: How does variability affect a biosecurity pro- 
gram that assume fixed point estimates on sensitivity and specificity? How large 
deviations are allowed before the recommended strategy is suboptimal. 

The following sections will show how the above considerations boil down to 
performing a two-way sensitivity analysis for an influence diagram with both 
discrete and continuous nodes. How to perform analytically sensitivity analy- 
sis in Bayesian networks is already well established [9,10] whereas the matter 
is more complicated in influence diagrams. Here, especially sequential decision 
problems quickly become intractable to handle [11,10]. As the above treatment 
considerations only contain a single decision, analytical calculations are tractable 
up to certain herd sizes, e.g. using Maple [12]. Solution of the diagram can also 
be done numerically using Gibbs sampling, where sensitivity analysis becomes a 
matter of performing many point-wise evaluations. For small herds both analytic 
and numeric solutions can be applied to verify correctness, while the numeric 
approach is the only tractable method once herd size become large. 

The structure of this article is as follows. Section 2 describes how the clin- 
ical treatment example can be formulated as an influence diagram. Hereafter, 
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Sect. 3 describes how to calculate expected utilities in this model in order to 
select the best decision alternative. Robustness of these decisions to variation of 
the diagnostic test sensitivity and specificity is illustrated in Sect. 4. Finally, a 
discussion of the obtained results is given. 



2 Influence Diagram Formulation 

This section introduces the notation used to describe the decision problem. Let 
the herd be of size N from which a simple random sample of size n is drawn. The 
aim of the investigation is to determine the proportion of diseased animals, i.e. 
p = d/N with d being the number of sick in the population. We assume that the 
true number of diseased individuals in the sample, D+, is obtained by drawing a 
sample of size n without replacement from the population. In this case D'^ fol- 
lows the hypergeometric distribution with parameters N, d, n. If sampling is with 
replacement or an infinite population can be assumed, is a sample from the 
binomial distribution with herd prevalence p G [0, 1]. Also, if n is small compared 
to both d and N — d the binomial distribution is a good approximation to the 
hypergeometric distribution. Such approximations are necessary because com- 
putations with the hypergeometric distribution quickly become intractable [1]. 
In the following, only binomial sampling is considered. The number of test posi- 
tives, T+, is then given as a sum of two binomial distributions with fixed values 
of the sensitivity, Se, and specificity, Sp, of the diagnostic test as parameters. 
Note that our interest is in the fixed value Se and Sp situation; otherwise a 
natural way to quantify uncertainty on the two variables would be by e.g. a beta 
distribution as in [7,8]. Figure 1 illustrates the above as a graphical model using 
notation from [13]. By specifying a graphical model we obtain a clear overview 
of the dependence structure of the variables. Furthermore, the decision part is 
easily specified using influence diagram notation for which software would exist 
to solve at least a discretized version of the problem. 




Fig. 1. Graphical model illustrating how the number of test positives, T+, is obtained 
by sampling with replacement introducing both false positives and false negatives. 
Double lined nodes indicate continuous nodes, however, the Se and Sp distributions 
will be trivial in our application 
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Fig. 2. Influence diagram describing the treatment strategy based on the number of 
animals tested positive 



The above distributional explanations are expressed as 
~ Bin(n,p), 

T+ ~ Bm{D+, Se) + Bin(n - D+ ,l - Sp). 



Inference for the herd prevalence can be formulated in the Bayesian context as 
follows. Given {n, T+, Se, Sp}, what is the posterior distribution on pi A typical 
application would be to use this distribution to calculate a posterior mean for 
p together with a credibility interval. This estimate could then be used by the 
veterinarian to determine whether a herd should be classified as disease free [7,8]. 

Classical survey sampling would be concerned with how large to choose n in 
order to get a certain confidence in p. Our focus is, however, on the application 
of the prevalence estimate, namely a decision to apply a treatment reducing 
prevalence. Going back the the herd context, a veterinarian typically has to 
decide between two decision alternatives: Either treat all animals in the herd, 
e.g. by adding antibiotics to the water supply, or do nothing. Whether to apply 
treatment is decided by the observed number of test positives. In order to decide 
which treatment to use, it is necessary to model how the disease prevalence will 
develop with time and how treatment influences it. A reward is given based on 
the disease prevalence which reflects the price of animals being sick. Figure 2 
extends the graphical model from Fig. 1 with decision and utility nodes (see [14]) 
making it an influence diagram. 

Here, the Dt node is the treat decision with states treat all (ta) and do nothing 
(dn). Furthermore, pt+i is the new prevalence^, Cd a utility node reflecting the 
cost of the treatment, and Ut+i a utility node indicating the cost of disease as 
a function of the new prevalence. The transition probability between the two 
prevalences is given by 



P{Pt+i\pt,Dt) 



f kspt if Dt = ta 
( Pt otherwise 



^ Basically, the situation could be handled without introducing a pt+i node by sim- 
ply integrating the disease development into the utility function. But our choice is 
conceptual clearer. 
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To illustrate the principle, simple proportional reduction in case of treating, i.e. 
0 < A :3 < 1 and preservation of status quo in case of not treating, is used. This 
ignores that an infectious disease would spread within the population if nothing 
is done. Modeling such a characteristic could although easily be done using e.g. 
a logistic model. 

Economic preference is modeled with the two utility nodes Cd and 17t+i. 
Typically, costs can be established on a per animal basis, which requires knowl- 
edge of the the number of animals, N, in the herd to make calculations realistic. 
A possible specification of the two utility functions could then be as follows. 

C{Dt) = —k\N I{Dt = treat all) 

Ut+i{pt+i) = -k2{pt+iN), 

where / is an indicator function. 

To solve the decision scenario of Fig. 2 it is necessary to find the decision 
alternative for Dt, which given evidence e = {n, T+, Ae, S'p}, yields the highest 
expected utility. Because we are using a continuous representation of p, standard 
Bayesian Network software for solving the influence diagram of Fig. 2 is not di- 
rectly applicable. Instead both an analytic solution method in Maple [12] and a 
simulation based using WinBugs [15] are investigated. For small herds the ana- 
lytic approach is doable and allows us to verify how good an approximation the 
sampling approach is in this situation. Advantage of the analytic implementa- 
tion is also that we can use the capabilities of Maple when performing sensitivity 
analysis. 

3 Derivation of the Expected Utility 

In order to calculate the required expected utility given e = {n, T+, S'e, Sp} we 
need to calculate the posterior distribution P{pt+i\e), which again requires cal- 
culation of P{pt\e). As already mentioned, only the binomial case is considered. 
To calculate P{pt\e) we exploit the standard result, see e.g. [16], that 

P{T+ = x\...)=(^^ 



pSe+ {1 - p){l - Sp) p{l - Se) + {1 - p)Sp 



If expert information exist on the prevalence of the herd this is easily integrated 
using prior distributions. If nothing is known, a uniform prior distribution for p 
is sufficient. Bayes Rule is exploited to obtain the posterior distribution 

P{p\T'^ ,n, Se, Sp) oc P(T^\p, n, Se, Sp)P{p\n, Se, Sp). 

To ensure that the above distribution is proper it is necessary to find an expres- 
sion for the normalization constant P{T'^\n, Se, Sp). Normally in a Bayesian 
analysis proportionality of the posterior is sufficient, but, as P{T^\n, Se, Sp) 
depends on Se and Sp, calculating it becomes a concern in the latter sensitivity 
analysis. 
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Fig. 3. Comparison of the posterior density P{Ut+i \ • • • ) calculated analytically and 
numerically; the x-axis is obtained utility while the y-axis is the corresponding density. 
The MCMC density is obtained by kernel smoothing the posterior samples obtained 
from WinBugs. The deviations at the end points are partly due to the kernel smoother 
and partly due to problems of the Gibbs sampler to investigate these areas 



Continuing our calculations we observe that pt+i is just a functional transfor- 
mation of pt when Dt = ta, i.e. we can use the standard rule for transformation 
of random variables to calculate the posterior P{pt+i\e, Dt = ta). If Dt = dn no 
transformation is needed. Regarding Ut+i as a random variable its distribution 
can be obtained in the same way as for pt+i by exploiting the above rule. Given 
an observed number of test positives, T+ = x, the expected utility of the treat 
and no-treat alternatives can now be calculated as 

EU(A = to) = E [Ut+i{pt+i,Dt = to)] -b Coita), 

EU(A = dn) = E [Ut+i{pt+i,Dt = dn))] + Coidn). 



The above has been implemented in Maple yielding functions of Se, Sp. To eval- 
uate the approximation of a simulation based approach the model was also for- 
mulated in WinBugs [15], which uses Gibbs sampling to calculate the expected 
utility. Figure 3 shows the posterior distribution of Ut+i obtained from Gibbs 
Sampling (using 10,000 samples after a burn-in of 1,000) and the analytical dis- 
tribution of EU{Dt = ta) in a pseudo realistic setup of Se = 0.8, Sp = 0.6, n = 
5,T+ = 2, fcs = i, ^2 = —20, ki = —1, N = 100. In the figure the analytical ex- 
pected utility (obtained by integrating the density between the worst case -1100 
and best case -100) is —475.1. The numeric mean (obtained as empirical mean 
of the samples) is —481.2. In the case Dt = dn we obtain values of —750.4 and 
—762.5, respectively. Hence, in the chosen setup we decide to treat all animals. 
Note that the WinBugs approach is much easier to implement and solve than the 
analytic approach and appears to be a good approximation. However, it lacks 
the power of being able to describe the expected utility as function of sensitivity 
and specificity. 

The desired strategy for Dt is now obtained by investigating the expected 
utility for both the ta and dn alternative for all 0 < T+ < n. Empirical invest!- 
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gations show that for this strategy there will exist a unique threshold T, s.t. 

T 7 TT^ n \rr+ \ \ dn\iG<x <T 

argmaxEU(A = d*|T = x) = | ^ ^ ^ ^ ^ 

That is, with the chosen specification of utilities and transitions, any strategy 
for Dt can be compactly represented by the minimum number of test positives 
necessary before all animals will be treated. In the setup used in Fig. 3 we obtain 
T = 0, i.e. we trivially treat no matter the number of observed test-positives. 

Assuming n, Sp, and Se to be fixed, the expected value of a strategy s for 
Dt is given as 



EU(s) = P{T~^ = x\n, Se, Sp) EU {Dt\T'^ = x, n, Se, Sp), 

x—0 

where EUs(Z?t| . . . ) denotes the expected utility obtained for Dt when choosing 
the decision dictated by s(a;). 

4 Sensitivity Analysis 

In realistic situations, the sensitivity and specificity of the test are either un- 
known or subject to a great deal of variation. If we e.g. recommend a fixed 
threshold to all veterinarians investigating diarrhea in pig herds, the large vari- 
ation in the two parameters between veterinarians would be ignored. A way to 
investigate a strategy’s robustness towards variations in sensitivity and speci- 
ficity is to find out how the best decision alternative changes with variation in 
Se and Sp. Here, the analytical representation in Maple is of advantage because 
we immediately have the expected utility as a function of Se and Sp. This is not 
possible using a simulation approach, instead the influence diagram would have 
to be solved for a grid of S'e and Sp combinations. 

Continuing with the values from the veterinarian example, but changing the 
sample size n to 10 and increasing the price of a treatment to k\ = —2, gives a 
more interesting example. Figure 4 shows the line of indifference, i.e. the solution 
of 



f{Se, Sp) = EU(Dt = ta\Se, Sp) — EU(Dt = dn\Se, Sp) = 0. 

To investigate the robustness of the decision using the sensitivity and speci- 
ficity configuration p = {Se', Sp') it might be worth to investigate how much p 
can change before a different decision is made. This is equivalent to finding the 
distance to the intersection line, i.e. 

Vc = dist(p, 1), where I = {{Se, Sp) \ f{Se, Sp) = 0} 

also known as the radius of change or radius of the safe-ball, see [11,10]. The 
higher this radius the more robust the specific policy is against variations. Also, 
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Fig. 4. Indifference between the two decision alternatives occurs on the line f{Se, Sp) = 
0, The figure shows the intersection of f{Se, Sp) with z — 0, i.e. the z-axis is the 
difference in utility between the two strategies. To the left of the intersection line, dn 
is selected, to the right ta 




Fig. 5. Threshold T as a function of {Se, Sp) - a so called treatment map. Calculated 
by evaluating the analytical expression for a grid layout of {Se, Sp) configurations 



the difference in expected utility between the two alternatives evaluated at spe- 
cific points tells us about the benefit of getting the {Se, Sp) correctly estimated. 

To get a better overview of the variation in the strategy we can illustrate the 
obtained T-values as a function of Se and Sp - a two-dimensional analogue of 
a rainbow diagram. Figure 5 shows this treatment map in case 30 of the herd’s 
100 individuals are investigated. 
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For a fixed sensitivity above 0.6, T is higher for specificities near 0.5 than 
those near 1. This might be surprising because a heuristic like “the higher the 
test quality the higher the number of positive tests before we react” feels natural. 
But such a heuristic neglects that a good test also results in fewer test-positives, 
because fewer are erroneously classified as positives. Looking at the figure also 
reveals that the radius of change for T will be quite low due to the high variation 
of T values over the parameter space. Again, this underlines the fact that care 
should be taken when sensitivity and specificity varies. 

5 Discussion 

Generation of treatment maps illustrating the sensitivity for varying probabilities 
is a strong tool helping to provide insight into the decision scenario. Calculations, 
where the expected utility function is given as an analytical function of S'e and 
Sp works until samples sizes of 30-40. Hereafter, Maple is not capable of dealing 
with the generated polynomials anymore. By fixing (S'e, Sp) and calculating its 
values on a grid much higher n can be achieved - either in Maple or by using 
Gibbs sampling in WinBugs. 

Estimation of the constants ki,k 2 , and fca for a specific decision problem is 
problematic; guesstimates, small scale experiments, and sensitivity analysis could 
be employed. Once a reasonable single time-slice model is established, extension 
to the more realistic case with additional time-slices is desirable. Biosecurity 
programs are often a temporal matter, where diagnosis and treatment are made 
repetitively. Limited memory strategies as in [17] might be necessary to obtain 
a tractable solution of the influence diagram. Despite such approximation our 
approach to sensitivity analysis would not scale up very well in respect to addi- 
tional decisions; even Gibbs sampling would only be feasible for a small number 
of decisions. 

To establish how large a sample size n to choose in order to make an op- 
timal decision about treatment would require conversion of n in Fig. 2 into a 
decision node together with a cost of performing the diagnostic test. An ana- 
lytical computation quickly becomes intractable here because n is part of the 
exponent of P(T"'"| . . . ). Solving the influence diagram with the two sequential 
decisions would have to be done by numerical methods such as forward Monte 
Garlo sampling or Markov chain Monte Garlo sampling as described in [18,19]. 

All these above mentioned problems would arise, in case one tries to evalu- 
ate and revise e.g. the current Danish Salmonella treatment strategy [3], which 
currently is taking neither uncertainty from imprecise tests nor any variability 
in sensitivity and specificity into account. This paper is merely an appetizer to 
work more intensively with the issues raised to get a practical solution. 
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Abstract. Branching Constraint Satisfaction Problems (BCSPs) have 
been introduced to model dynamic resource allocation subject to con- 
straints and uncertainty. We give BCSPs a formal probability seman- 
tics by showing how they can be mapped to a certain class of Bayesian 
decision networks. This allows us to describe logical and probabilistic 
constraints in a uniform fashion. We also discuss extensions to BCSPs 
and decision networks suggested by the relationship between the two 
formalisms. 



1 Introduction 

Resource allocation is the problem of assigning resources to tasks subject to con- 
straints, and has been studied in operations research and computer science for 
many years [1,2,9]. Recently, the problem has been investigated using constraint 
satisfaction methods [17], which allow arbitrary combinatorial constraints to be 
placed on the problem. In its simplest form, tasks can be represented by vari- 
ables, and resources by values to be assigned to the variables, while constraints 
restrict the values that can be assigned simultaneously. A solution to a problem 
is then an assignment of values such that all constraints are satisfied. Initially, 
such approaches were restricted to deterministic, static problems; more recently 
it has been extended to problems that change over time, and for which there is 
some uncertainty about what the changes will be. Branching Constraint Satis- 
faction [6,7] has been proposed to model problems where new variables (or tasks) 
are added to the problem after some decisions have been made. The uncertainty 
in the sequence of additions is modelled by a transition tree with arcs labelled 
with probabilities. Branching CSPs are known to be NP hard [8]. Complete 
and incomplete optimising algorithms have been developed, using a combina- 
tion of constraint-based tree search and decision-theoretic computation, and the 
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methods have been compared to those used in Markov Decision Problems [8]. 
However, the probability semantics of BCSPs were presented only informally. 

Bayesian networks have been introduced as formalisms to represent and rea- 
son with joint probability distributions, taking into account conditional indepen- 
dence statements [15]. Given a Bayesian network and a (possibly empty) set of 
evidence concerning the variables included in the network, the probability distri- 
bution on any subset of variables can be computed. For this, efficient algorithms 
exist with fast, well-engineered computer implementations [10,13], even though 
the problem of probabilistic reasoning in Bayesian networks is known to be NP 
hard in general [4]. However, reasoning with Bayesian networks for real-world 
problems is normally feasible. Formally, Bayesian networks can only be used for 
probabilistic reasoning, but recent work [14,12] shows how some logical consis- 
tencies can be modelled and solved. Finally, we can augment a network with 
decision theory, to obtain influence diagrams or decision networks, which can be 
used for decision-making under uncertainty [10,16]. 

In this paper, we study the relationship between branching CSPs, Bayesian 
networks and decision networks. Our aim is to establish the probability semantics 
by mapping BCSPs to decision networks, providing a uniform representation for 
probabilistic and logical constraints. We introduce Branching CSPs, giving a 
precise, formal definition, and we summarise Bayesian networks and decision 
networks. We then show how BCSPs can be mapped to decision networks, and 
in particular we show how to represent combinatorial constraints. We prove that 
optimal solutions to problems in the two different formalisms are equivalent. 
Finally we consider how the techniques of decision networks may be used to 
generalise BCSPs, and, similarly, how BCSP methods might allow us to make 
explicit use of constraints in decision networks. 



2 Branching Constraint Satisfaction Problems 

2.1 Preliminary Definitions 

In the following, we borrow the terminology for graphs from [18]; if S' = (V,A) 
is a directed tree with set of vertices V and set of directed arcs A C V x V, then 
the set of children of a vertex u G P is denoted by a^v); the unique parent of 
a vertex r ; G P is represented by tt(v). Furthermore, the level of a vertex v is 
defined as the length of the path from the root to v [11]. The set of all vertices 
in the tree at the same level n G N is denoted by A(n). The terminology will be 
generalised for acyclic directed graphs. Sets of elements will be represented by 
bold face letters, e.g. V, if confusion may arise otherwise. 



2.2 A Motivating Example 

We first present a simple motivating example. A company has three workers, x, 
y and z, and five possible tasks, A, B, C, D and E that it may be asked to carry 
out. Each worker is qualified to do some of the tasks, as shown in Fig. 1; each task 
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Staff 


Tasks 

ABODE 


X 




y 


^ ^ 


z 


V- V - - 


Utilities 


3 6 10 6 6 



Vo /A 



vi/B 




V2/C 




V3/D 




va/E 




V3/D 




ve/B 



Fig. 1. An example BCSP. Left: table of staff skills and tasks, with associated utilities 
for individual tasks; means that the task is suitable for the worker, and ‘ — ’ that it is 
unsuitable. Right: probabilistic state transition tree. An entry v/X indicates that vari- 
able X arrives in vertex v; numeric labels on the arcs indicate transition probabilities. 



is associated with a utility, representing the profit resulting from completing the 
task successfully. No worker can do more than one task. The company has some 
uncertain knowledge about the sequence of tasks it will be asked to perform, 
sketched as a probabilistic state transition tree in Fig. 1. There will definitely 
be three tasks, and the first task to arrive will be A. Subsequently, either task 
B or C will arrive, with probabilities 0.6 and 0.4 respectively. If the second task 
is B, then the last task will be either D (with probability 0.5) or E (probability 
0.5). If the second task is C, then the last task will either be D (0.5) or B (0.5). 
Some sequences of tasks may not be feasible for the company to do, and so it 
may choose to reject some tasks. The aim is to assign workers to tasks as soon as 
the tasks arrive, maximising the expected utility, while ensuring all constraints 
are satisfied. 

2.3 Formal Definition 

We give the formal definition of branching CSPs below. T, or null, is a special 
value used to represent an explicit decision not to assign a value to a variable. 
An assignment of T to a variable will mean that any constraint on that variable 
will be satisfied by default. 

Definition 1. A binary branching CSP is a tuple BCSP = {X, D, 6, C, U, S, t): 

— X is a finite set of variables; 

— D is a finite set of values, with function S : X ^ p{D U {T}) associating 
a domain of possible values to each variable x € X, such that T G 6{x) for 
each X G X ; 

— C is a finite set of binary constraints, where each c G C is a set of triples 
{x,y,R), x,y G X, and R C <j(a;) x S{y) such that Va G S{x)\/b G 6{y) : 
(T, b) G R and (a, T) G R; 
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~ : X X (D U {T}) — >■ K associates a utility to each value w € D U {T} 

assigned to a variable x G X, with U{x,T) = 0 for each x G X; 

— S = {V,A,^) is a probabilistic state transition tree with vertices V and arcs 

A; there is a distinguished vertex vq G V called the root, which has no parent; 
the function 7 : F x F — >■ [1, 0] is defined such that j{v, v') = 0 if {v, v') ^ A, 
and if a (v) ^ 0, X)t,'ecr(i;) each v G V ; represents the 

conditional probability that vertex v' is the next to become active, given that 
the previous active vertex was v; 

— T : V ^ X is a surjective function such that for any two vertices v, v' on the 
same path p in S, v = v' if t{v) = t(v'). t assigns a variable to each vertex, 
ensuring that no variable appears twice on a path from root to leaf. 

The probabilistic transitions are defined in terms of the vertices of the tree, 
and not directly in terms of the variables. Each vertex represents an event, and 
multiple different events may cause the same variable to become active. The 
probability of an event depends only on its immediate predecessor, and thus the 
problem obeys the Markov property. 

Definition 2. An assignment to a BCSP is a function : E — >■ I?U {T} which 
assigns to each vertex either a value from the domain of its associated variable 
or the null value T. 

Definition 3. A solution to a BCSP is an assignment (p such that ifv and w are 
vertices on a path in S = {V,A,^) and {t{v),t{w),R) G C, then ((p(v), (p(w)) G 
R, i.e. If satisfies all constraints appearing on a path. 



Definition 4. The expected utility of a vertex v in a solution p to a BCSP, 
denoted by U,p{v), is defined as 

U^{v) = U{t{v),p{v))+ ^ ’y{v,v')ij,f,{v') 

v'^a{v) 

The expected utility of a solution to a BCSP is the expected utility of the root 
vertex in the solution, i.e. U,p{vo). 

Note that a solution 1 ^ is a contingent solution, specifying an assignment to 
a variable dependent on the sequence of arrivals. In fact, the assignments are 
defined in terms of events (i.e. vertices of the tree), and not directly in terms 
of the variables. Further, the solution can be executed as the problem unfolds; 
the assignments are not dependent on subsequent developments of the problem. 
Thus the solution is a policy. 

Definition 5. Let Vi be a vertex at level i in the tree S = {V,A,^), and let 
h = {(uo, a^o)j a^i)j • ■ • ) ("yi-i) be the history of assignments made at 

vertices in the path from vq to Vi, with Vj+i G cr{vj), in some solution (p. We 
say that the pair (vi,Xi) is consistent with h, written (vi,Xi) oc h, if it satisfies 
all constraints between Vi and assignments in h. 
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Definition 6. The maximum expected utility at a vertex v given its history 
h is the maximum, over eonsistent assignments, of the utility of an assignment 
plus the weighted sum of the maximum expected utility of the child vertices, given 
the history h extended with the new assignment: 



U{v \h.) 



max 

aiG<5(T(L!)):('U,a:)och 



U{t{v),x) 



7(u,u')C/(w' I hU{(u,x)}) 



The goal of a BCSP is to find a solution with maximal expected utility. The 
maximal expected utility is thus U{vq \ 0). A BCSP is essentially a decision 
tree, but which separates out the probabilities from the logical constraints on 
the decisions. It is possible to combine the constraints into the tree, but at the 
cost of a (worst-case) exponential explosion in the tree size [8]. 

Reconsider the example introduced above. Formulated as a branching CSP it 
holds that X = {A, B, C, D, E}, D = {a:, y, z}, <5(A) = • • • = 6{E) = {a:, y, z, T }, 
U{x,T) = 0 for each x G X, and for each w G D: U{A,w) = 3, U{B,w) = 6, 
U{C,w) = 10, U{D,w) = 6, U{E,w) = 5, and the constraint set C consists of 
the following elements: 

- (A, B,{(x, T), (y, x), {y, T), (z, a:), (z, T), (T, a:), (T, T)}) 

- (A, C,{{x, z), {x, T), (y, x), (y, z), (y, T), (z, x), {z, T), (T, x), (T, z), (T, T)}) 

- (A, D,{{x, T), (y, x), (y, T), (z, x), {z, T), (T, x), (T, T)}) 

~ {A, E, {(a:, y), (a;, T), (y, a;), (y, T), (z, a;), (z, y), (z, T), (T, x), (T, y), (T, T)}) 

~ {B, C,{(x, z), (x, T), (T, x), (T, z), (T, T)}) 

- (R,D,{(x,T),(T,x),(T,T)}) 

- {B, E,{{x, y), (x, T), (T, x), (T, y), (T, T)}) 

- (C, D, {(x, T), (z, x), (z, T), (T, x), (T, T)}) 

- (C, E, {(x, y), (x, T), (z, x), (z, y), (z, T), (T, x), (T, y), (T, T)}) 

~ {D, E,{(x, y), (x, T), (T, x), (T, y), (T, T)}) 

The probabilistic state transition tree S = (V, A,y) with the definition of the 
function t is according to Fig. 1. The optimal solution is (fi{vo) = y, 

= z, p{vz) = T, (p{v 4 ) = T, (^(us) = X, (p{vo) = X, with expected utility 
U,p{vo) = 13. Note that the task D is given a different allocation depending on 
the arrival sequence: it is rejected if it arrives in event V3 (after B in xi), but it 
is allocated worker x if it arrives in event X5 (after C in X2). 

The definition above is a slightly modified form of the one given in [7]. There it 
was assumed that the the utility function U did not distinguish between different 
values for a given variable (with the exception of T); i.e. U{x,v) = U{x,v') for 
each v' ,v G (5(x)\{T}. Also, in the probabilistic state transition tree, the sum 
of the transition probabilities for the children of a vertex was allowed to be less 
than 1 . The missing probability represented the case where the parent event had 
no successor. In the definition given here, we could represent this by having a 
special variable whose domain is restricted to T, and ensuring any vertex which 
activates this variable has no children. 
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3 Bayesian Networks and Decision Networks 

A Bayesian network is a pair B = (G,P), where G = (N,A) is an acyclic 
directed graph with set of chance nodes N, representing random variables, and 
set of arcs A C N x N, representing statistical independence relationships among 
the variables [15]. Here we assume all random variables to be discrete. A joint 
probability distribution P is defined on the set of variables as follows: 

P{N) =1[P{X\ n{X)) 
xgn 

A Bayesian network allows for computing any a posteriori probability distribu- 
tion of interest after entering evidence e into the network. In Bayesian network 
software packages, a posteriori probability distributions are computed from the 
marginal probability distribution of an updated probability distribution P®; for 
every (free) variable A G N, it holds that 

P®(A) = P(A I e) 

A decision network V = (G, P, N, D, W, u), or influence diagram, is a Bayesian 
network with the addition of decision nodes D and utility nodes W, standing 
for decision and utility variables, respectively. There is always a unique directed 
path in a decision network, on which every decision node P in D occurs, i.e. 
decision nodes are linearly ordered. Each utility variable IE G W stands for a 
utility function uw ■ S{Z) — >• M, where 6{Z) is the Cartesian product of the 
domains of variables in Z, and Z = tt{W). The collection of utility functions is 
indicated by u. 

Initial proposals of decision networks only included a single utility node. In 
more recent descriptions, such as in the book by Jensen [10], a decision network 
may incorporate more than one utility node Wi, i = 1, . . . , n, and it is assumed 
that the resulting multi-attribute utility function uw is additive, i.e. the resulting 
utility Mw is defined as follows: 

n 

Mw(Z) = ^ UWi(Zj) 

i=l 

where Z, = 7r{W,), W = Z = Uti Z* = 7r(W). Clearly, defining 

a utility function in this fashion reduces the amount of utility information that 
has to be specified; the space-complexity reduction can be as drastic as from 
exponential to linear. 

The aim of evaluating a decision network is to determine the optimal expected 
utility u for each decision d at a given decision node D, given the available 
evidence e, which includes all previously made decisions. We assume a topological 
order ^ of the nodes in the network, in which we have combined consecutive 
nodes of the same type, and we place the utility nodes last. Thus we have Yq ^ 
Do ^ Yi Di Dn-i ~<Yn~<W. We then define the maximum expected 

utility at a decision node Di given some evidence e to be 

uui{e) = max iiYi^iieU {Di = di}) 
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and at a chance node Yi the expected utility is 

UYi{e) = ^ P{Yi = Hi I e)ttD,(e U {Yi = yj) 

VieYi 

In particular, we have the expected utility over the whole network: 

*’d)(^)= YI = yo)uDo({Yo = yo}) 

yo&Yo 

and for the terminating case we have: 

UY„{e) = Y P(Yn = Vn \ e)uw{eU {Yn = yn}) 

In diagrams of Bayesian networks and decision networks, chance nodes are in- 
dicated by circles or ellipses, decision nodes by boxes and utility nodes by dia- 
monds. 



4 Relationship of Branching CSPs to Decision Networks 

4.1 Mapping Branching CSPs to Decision Networks 

Let BCSP = {X, D,S,C,U, S,t). Below, we define the steps that make up the 
mapping from this representation to a decision network T> = (G, P, N, D, W, u). 

— For each set of vertices A(n) at level n G N of the tree S, there is a chance 
node Yn- The domain of the associated random variable is 6{Yn) = {u | 
V G A(n)}. The associated probability distribution P is defined by: P(T„ = 
u I Yn-i = v) = 'y{v,u) for n > 0, and P{Yq = uq) = 1- Note that for 
two vertices v G A(n — 1) and u G A(n) with (v,u) ^ As we have that 
P(Yn = u I Yn_i = v) = 0, indicating that this transition cannot take place. 

— Corresponding to each random variable with domain 5(T„), there is a deci- 
sion node Dn, with domain equal to 6{D„) = {v.x \ v G S(Yn),x G J(r(u))}. 
There exists an incoming arc to each decision node from its associated chance 
node. In addition, the decision nodes are linked in a chain in an order re- 
flecting the order of their associated chance nodes. The nodes will be used 
to assign values to their associated decision variables, which corresponds to 
assigning values to variables in the BCSP. 

— For each chance node Yn there is a corresponding utility node [/„. The par- 
ents of Un are and the decision node D„. If takes value v, and the 
decision node D„ takes any value v.x, x yf T, then the utility value is 
U(t(v), x); otherwise it is 0. The utility nodes give their reward if a vertex 
(and hence a variable) has become active, and we have assigned a non- null 
value to that instance of the variable. 

— For each pair of chance nodes (Yi,Yj) such that there are vertices v G S{Yi) 
and v' G S{Yj) with a constraint (t{v),t{v'),R) G G, there exists a chance 
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node Cij with domain {t, /} to represent the constraints on the correspond- 
ing decisions. The parents of Cij are the decision nodes Di and Dj. The 
probability distribution is defined as follows: 



= t\ Di = v.x, Dj = w.y) 



0 if (r(t>), r(w), i?) G C, (x, y) i R 

1 otherwise 



— There is one distinguished utility node Uc, whose parents are all the con- 
straint chance nodes, with utility value 0 if all parents have value t, and 
utility value equal to —M otherwise, where M is a, penalty value larger than 
the sum of all utilities in the BCSP. This node ensures that the constraints 
are satisfied. 

— Finally, for any given history in the execution of a BCSP solution, there is a 
corresponding evidence set for the network, defined by the function j3 below. 
Let H be the set of all possible history sets, and E be the set of all possible 
evidence sets. Then 



/? : H — E : h I— >■ {Yi = v,Di = v.x : {v, a;) G h, u G A(i)} 

Note that there are particular features of the mapping above, which can be 
exploited to simplify the utility calculations: 

(1) Each variable Yj is conditionally independent of each variable Yk, k = 
0, ... ,j — 2, and of each decision variable Di given variable Yj_i. 

(2) The utility function u defined above for the utility nodes Uj is additive: 

m 

u{T{yo),do, ■ . ■ ,T{y^),djn) = '^U{T{yj),dj) 

i=0 

where yj is a possible value of random variable Yj and dj is a possible value 
of decision variable Dj. 

(3) We can create a topological order Yo Dq Y\ ^ D\ Dn < 

C ^ W where C represents the constraint chance nodes, and W represents 
the utility nodes. The initial node Yq has domain {wq}) so the maximum 
expected utility of the network ■uyp(0) = udoOxi = vq), and the utility 
function uw{e) = Yn=[)U{T{yi),di) -|-Mc(e). 

We can now simplify the utility definitions as follows: 

= niax \U{r{yi),di) + MY,+i(e U {A = di})] 

■di)„(e) = max [C/(t(?/„), d„) -I- ftc(e U {£>„ = d„})] 

dn^Dn 

The lie term in the second equation is simply the maximum expected utility 
from the constraint nodes. If any of the constraints evaluate to false, then the 
utility is —M. Otherwise, it is 0. 

The highest expected utility at the first decision node is equal to the optimal 
expected utility of the BCSP, and the optimal decisions of the decision nodes 
correspond to the optimal plan of the BCSP. We prove this in the next section. 
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Fig. 2. Decision network resulting from the mapping of the example BCSP. 



The result of mapping the example BCSP discussed in Sect. 2 is shown 
in Fig. 2. From the mapping designed above, it follows that the domain of 
the variable Yq is equal to {uq}, for Yi it is equal to {vi,V 2 }', the domain of 
the decision variable Dq is {vo.x,vo.y,vo.z,VQ.T}, and for D\ it is equal to 
{V\.X, Ul-T, V2-X, V2-Z, V2-Y}. 

4.2 Proof That the Mapping Is Correct 

We need to show that the optimal solution to the BCSP (i.e. the maximum 
expected utility at the root node) has the same value as the maximum expected 
utility of the first decision node in the network. 

We will show that the maximum expected utility from any node in the tree 
given some history is the same as the maximum expected utility from the cor- 
responding decision node in the decision network, given the corresponding evi- 
dence. 

Theorem 1. Let V = (G, P, N, D, W, u) he the decision network corresponding 
to the BCSP = {X, D, S, C, U, S, r) obtained by the mapping defined in Sect. 4-f 
then for each node at level k it holds that: 

U{vk I h) = UD{P(h.) U {Yk = Ufc}) 

Proof. (By backwards induction on the level of the node in the tree.) 

Basis Suppose u is a vertex in A(n), with n maximal level. Then v must be 
a leaf vertex. It holds that 

U(v\h)= niax U{t{v),x) 

x^d{T{v)):{y,x)ccn 
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and Yn and D„ are its corresponding chance and decision nodes. For the decision 
network, v must be the value observed at chance node Y^, so we have: 

UD„iP{h)LI {Y„ = v}) = max [U{T{v),d„] + uc{P{h) LI {Y„ = v, D„ = d^}] 

By the definition of the penalty value, we only need to consider those which 
do not violate the constraints. There will always be at least one, namely v.T, 
and so the uc term will be 0. Thus we have as required 

UD„il3{h) LI {Y„ = v) = max U{t{v),x) 

xGS{r{v)):{v,x)och. 

Induction hypothesis Suppose that 

U{vj I h) = UDj{f3{h) U {Yj = Uj}) 

holds for all vertices in the BCSP at levels j = n,n — 1, . . . ,i + 1. 

Induction step Now consider a vertex v in the BCSP at level i. If u is a leaf, 
then the result is true by the basis argument. Now, suppose v is not a leaf, then 
it holds that: 



U{v I h) 



max 

aiG<5(T(i’)):('U,a:)och 



U{t{v),x) 



v'^cr{v) 



but v' must be a node at level i + 1, so by the induction hypothesis 



= max \U(r(v),x) 

x^S{r{v)):{v,x)och 

+Y (/3(h) U {Yi = v,D, = v.x, Yi+i = ?;'})] 

v'^(t{v) 

but since all Vi+\ € with Vi+i ^ a(v) give a zero probability, and the 

decisions in Di which give a non-negative utility are exactly those in 6(t(v)) 
which satisfy the constraints in h 

= nuK [U{T{v),d^)+'Y P{Yi+i = v' \ Yi = v) 

v'eYi+i 

UDi+i (/3(h) U {Yi = v,Di = dt, Yi+i = w'})] 

= jn^ [U{T{v),di) + UYi+iiPi^) U = v,Di = dj)] 

= UDiiPih.) U [Yi = w}) 



and thus we have proved the result by induction. 

As a corollary, we obtain that if the root node of the BCSP has an empty history, 
we can write U{vq) = udo(Yq = vq). Thus, we have proved the equivalence of 
the two representations of the problem. 
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5 Future Work: Generalised Branching CSPs 

The mapping designed in Sect. 4.1 not only offers a decision-theoretic description 
of a BCSP’s components, but also indicates how these components interact. The 
decision networks that are produced are of a restricted form. Studying these 
restrictions suggests ways in which BCSPs might be generalised to handle a 
wider range of problems. 

The Bayesian network component of the resulting decision network is a linear 
chain Iq bi —>■•••—>■ with a completely certain initial event, whereas gen- 
eral Bayesian networks are directed acyclic graphs. This restriction arises from 
the BCSP state transition tree, and the fact that the root node of the tree is the 
known arrival of the first variable. The certain initial event can easily be relaxed 
by having an empty root vertex with a number of possible children, but relax- 
ing the cause of the linear chain would require replacing the tree with a directed 
acyclic graph more similar in style to a Bayesian network. This would allow us to 
represent events which have multiple conditionally independent successors, while 
maintaining a temporal interpretation of the arcs, instead of mutually exclusive 
children as at present. Similarly, we could represent mutually independent par- 
ents of an event, instead of single parents. We would also be able to make a 
distinction between temporal (state-transition) and atemporal arcs, thus giving 
us a structure similar to a dynamic Bayesian network [5] . 

BCSPs currently assume that the arrival of variables is governed by uncer- 
tainty, but actual decisions to assign a value to a variable do not influence the 
uncertainty. By adding observation nodes to the uncertainty structure, linking 
these to explicit decision nodes, and taking these observations into account when 
assessing utilities, we could model situations where decisions may have an effect 
on the future distribution of tasks. 

If we introduce both non-temporal arcs and explicit decision nodes, then we 
can represent problems where instant decisions are not necessary. Solutions to 
the problem could wait until more evidence had been received before making a 
decision (a restricted form of this was proposed in [7]), or the solution method 
would be required to decide upon the best sequence of decisions. 

It should be noted that new algorithms would be required for the BCSP 
generalisations discussed above, and that these algorithms might be neither easy 
to develop nor efficient. Further study will be aimed at determining which of the 
generalisations still allow us to solve BCSPs in reasonable time (in the average 
case). A different approach might be to add explicit logical constraints into a 
decision network, and then attempt to produce BCSP-style algorithms for these 
extended decision networks. 

Finally, although we have presented a mapping from BCSPs to decision net- 
works, we have said nothing about the complexity of the mapping, or the ease of 
constructing representations of problems. One of the advantages of Constraint 
Programming in general is the ease of modelling, and the simplicity with which 
complex combinatorial constraints can be expressed. Similarly, constraint algo- 
rithms are design to take advantage of the structure of the constraints. We need 
to establish the complexity of the transformation in Sect. 4.1, and determine 
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what effect the extensional representation of the constraints has on the running 
time of the decision network algorithms. Results here would help indicate which 
of our plans for future work would be most profitable. 

6 Conclusions 

In this paper, we have given a decision-theoretic interpretation to a particular 
class of constraint-satisfaction problems with uncertainty, viz. branching con- 
straint satisfaction problems. We have done this using decision networks as a 
representation formalism to which decision-theoretic, probabilistic and logical 
constraints were mapped, giving rise to a uniform representation. The biggest 
advantage of this approach is that it allows us to study the interactions between 
the various components of a BCSP more clearly. In addition, the insight gained 
this way acted as a suitable foundation for the design of extensions to the original 
BCSP formalism. 
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Abstract. The purpose of this paper is to show how probabilistic argu- 
mentation is applicable to modern public-key cryptography as an appro- 
priate tool to evaluate webs of trust. This is an interesting application 
of uncertain reasoning that has not yet received much attention in the 
corresponding literature. 



1 Introduction 

In large open networks like the internet an increasing demand for security is 
observed. In order to establish a confidential channel between two users of the 
network, classical single-key cryptography requires them to exchange a common 
secret key over a secure channel. This may work if the network is small and local, 
but it is infeasible in non-local or large networks. To simplify the key exchange 
problem, modern puhlic-key cryptography provides a mechanism in which the 
keys to be exchanged are not secret. In such a framework, every user owns a 
key pair consisting of a (non-secret) public key and a (secret) private key. Only 
public keys are exchanged. They are used to encrypt messages to be sent to the 
owner of the key and to verify digital signatures issued by the owner of the key. 

Before using someone else’s public key to encrypt a message or verify a signa- 
ture, one should make sure that the key really belongs to the intended recipient 
or the indicated issuer of the signature. Achieving authenticity of public keys 
can be done in several ways. The most popular approach is based on the concept 
of digital certificates . The idea is that different users of a network certify public 
keys of other network users. This leads to a certificate graph. Of course, certifi- 
cates should only be issued if the key’s authenticity is verified. On the basis of a 
certificate graph, one can then evaluate the authenticity of the keys on the basis 
of how much trust one assigns to the different issuers of the certificates. Because 
such an evaluation depends on trust, it is common to call such a certificate graph 
web of trust. Section 2 gives a short introduction to public- key cryptography and 
webs of trust. For more information we refer to the literature [10,18]. 

* Research supported by (1) Alexander von Humboldt Foundation, (2) German Federal 
Ministry of Education and Research, (3) German Program for the Investment in the 
Future. 
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PGP (Pretty Good Privacy) is a widely used implementation of public-key 
cryptography for email security [21]. It organizes public keys on the basis of a 
web of trust [17]. PGP’s way of evaluating the web of trust is a simple mecha- 
nism based on three pragmatic rules. Some authors have tried to formalize the 
concepts of trust and confidence more properly [1,2,3,9,12,13,19], but approaches 
to look at the problem from the perspective of the uncertain reasoning commu- 
nity are rare. One exception is the idea of applying Dempster-Shafer theory to 
a distributed reputation management [20] . 

Among the various formalisms for uncertain reasoning, probabilistic argumen- 
tation [8] seems to be to most promising candidate. Ordinary Bayesian networks, 
for example, fail as a possible candidate because they require the underlying 
graphs to be acyclic [11] (whereas general certificate graphs are cyclic). The 
basic concepts of probabilistic argumentation are summarized in Sect. 3. As 
Sect. 4 demonstrates, by modeling trust as the probability of somebody’s relia- 
bility, translating a web of trust into a corresponding probabilistic argumentation 
system is straightforward. And it leads to a one-to-one correspondence between 
the concepts of certificate chains and arguments. Degree of support, which is 
the probability that at least one argument holds, can then be used to measure 
quantitatively the overall reliability of all possible certificate chains and to rate 
the validity of the public keys. 

The goal of this paper is twofold. First, it is supposed to increase the aware- 
ness of people interested in reasoning under uncertainty for this interesting ap- 
plication in public-key cryptography. Second, the paper intends to demonstrate 
how to use probabilistic argumentation in real world applications and to under- 
line the value of this elegant formalism. 

2 Public-Key Cryptography and the Web of Trust 

Modern cryptography consists of two major tasks: encryption and signing. To 
transmit a message m securely from sender A to recipient B, both sender and 
recipient have to be equipped with a corresponding pair of public and private 
keys. Private keys are kept secret, whereas public keys are widely available for 
any recipient. From A’s perspective, sending m over an insecure channel (e.g. the 
internet) to recipient B requires A to encrypt the message with B's public key 
and to digitally sign it with A’s own private key. On the side of recipient B, the 
message is decrypted with B's private key and the digital signature is verified 
with A’s public key. Provided that A and B have properly exchanged their 
public keys, this simple scheme realizes the main security goals (secrecy, message 
integrity, authentication, non-repudiation) for such a two-party communication. 

Public keys are usually distributed with the aid of key servers. Before sending 
encrypted messages to recipients B, A may copy B’s public key from a key 
server. On the other side, B may copy A’s public key from a key server in order 
to verify A’s digital signatures. The question is whether the keys copied from 
the key servers are really owned by B and A, respectively. A possible attacker 
O could easily generate key pairs and post the corresponding public keys in A’s 
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or i?’s name onto the server. Encrypting a message falsely to O’s public key 
enables O to decrypt and read the message (and at the same time disables B 
to decrypt the message) . Similarly, verifying a digital signature with false public 
key enables the attacker O to sign messages in the name of A. An important 
issue is thus the verification of public keys before use. One way to verify a public 
key is to compare its unique fingerprint (a hash code of fixed length) over a 
secure channel (e.g. the telephone line). This method may work in small or local 
networks with few users, but is impractical in large networks like the internet. 

The most practical way to solve the public key exchange problem is to use 
digital certificates . A certificate can be seen as a signed public key. For example, 
to issue a certificate for A, the issuer C digitally signs A’s public key with C’s 
own private key. By doing this, C certifies that A is the true owner of the key. Of 
course, certificates should only be issued when the public key was either obtained 
or successfully verified over a secure channel. Note that certificates may consist 
of signatures from different issuers. 

If A receives B's certificate issued by C, then A has good reasons to accept 
the corresponding public key as B's public key, whenever the following three 
conditions are satisfied: 

— A fully trusts C to carefully verify public keys before issuing certificates, 

— A has received or verified C”s public key over a secure channel, 

~ A has successfully verified the certificate using C’s public key. 

A collection of digital certificates is called puhlic-key infrastructure (PKI). In 
practice, there are two approaches to build PKIs. 

2.1 Certificate Authorities 

The first approach requires the certificates to be issued by trustworthy certificate 
authorities (CA). For example, if C is a trustworthy CA (i.e. before issuing a 
certificate, C carefully checks if the applicant is the true owner of the public key), 
then the users of a large network may exchange their public keys by exchanging 
respective certificates issued by C. Certificates issued by C can be verified using 
C’s public key. From the successful verification follows then the authenticity 
of the corresponding public key. If more than one CA issues certificates, it is 
possible that the different CAs mutually issue certificates to each other. This 
leads to undirected certificate trees which are usually organized hierarchically. 
Figure 1 shows such a tree in which network users are represented by circles and 
CAs by squares. An arrow from entity X to entity Y (users or CAs) represents 
X’s certificate issued by Y. The formal notation for such a certificate will be 
A ^ y. 

If A has an authentic copy of Autfis public key, then the authenticity of M’s 
certificate M Autz can be verified using Autfis, certificate Autz Aut^ and 
Aut 2 ’s certificate Aut 2 ^ Aut\. Such a certificate chain M Autz ^ Aut 2 ^ 
Auti A requires A to fully and unconditionally trust all CAs along the path 
between M and A. If any CA in the path has incorrectly issued the certificate of 
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the next CA, then A can be misled regarding the authenticity of M’s certificate. 
Note that there is a unique certificate chain between any two users attached to 
such a certificate tree. 

The major advantage of such a centralized PKI is that every user is required 
to employ only one secure channel in order to get an authentic copy of its own 
CA’s public key. The major disadvantage is the requirement of unconditional 
trust in all CAs involved. 

2.2 Web of Trust 

The second approach does not require certificate authorities. The idea is that 
every user in the network can issue certificates. This leads to certificate graphs 
rather than trees. In such a decentralized context, one usually speaks about sign- 
ing public keys rather than issuing certificates. Thus, every user collects signed 
public keys from different keys servers or other sources. A personal collection of 
signed public keys is called key ring. Note that every individual key ring defines 
a corresponding certificate graph (which is a sub-graph of the complete certifi- 
cate graph of all signed public keys). Figure 2 shows the certificate graph that 
corresponds to A’s key ring. An arrow from user X to user Y means that Y 




Fig. 2. Example of a certificate graph. 
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has signed X’s public key. Question marks represent users whose public keys are 
unknown to A. A has directly signed the public keys of B, C, D, E, and F. 
This means that A has received or verified these keys over a secure channel and 
accepts them as the authentic keys of B, C, D, E, and F, respectively. In other 
words, the public keys of B, C, D, E, and F are valid for A. Many other keys in 
the graph are signed by users different from A. User G, for example, has signed 
the keys of L and M. From A’s perspective, G is called introducer of Us and 
M’s certificate. 

In order to indirectly validate someone else’s public key, A must have full 
confidence in all introducers along the path of at least one certificate chain. This 
means that A must consider the corresponding introducers to be trustworthy in 
the sense that they only issue certificates for public keys received or verified over 
secure channels. 

Example 1: There is only one certificate chain L ^ G ^ B ^ A from L to 
A. In order to validate Us public key, A has to trust both G and B. 

Example 2: There are two certificate chains M ^ G ^ B ^ A and M 
H ^ G ^ A from M to A. In order to validate M’s public key, A has to trust 
either G and B or H and G. 

A certificate graph in which the validity of the public keys is evaluated on 
the basis of trust is called web of trust. Note that hierarchical certificate trees 
with fully trustworthy CAs are particular webs of trust. 

A general web of trust allows the owner of the key ring to specify gradual 
levels of trust for all individuals involved in the web. “Completely trusted” and 
“untrusted” are the two extreme cases of maximal and minimal trust, respec- 
tively. Evaluating the validity of the public keys should then lead to gradual 
levels of validity. Full validity, for example, only results from full trust along 
the path of at least one certificate chain (such as in the hierarchical case using 
CAs) . The evaluation of such a general web of trust on the basis of probabilistic 
argumentation systems is the topic of this paper. 

2.3 PGP’s Web of Trust 

PGP is one of the most popular tools for public-key cryptography. The software 
can be used to encrypt and digitally sign electronic mail. It is based on a web of 
trust with some particular characteristics. First of all, PGP allows (only) three 
levels of trust: “completely trusted”, “marginally trusted”, and “untrusted”. 
Note that the owner of the key ring automatically receives full trust. In order to 
rate a public key as “valid”, PGP either requires 

a) the key to belong to the owner of the key ring, 

b) a signature from at least one^ completely trusted introducer with a “valid” 
public key, 

c) signatures from at least two^ marginally trusted introducers with “valid” 
public keys. 

^ One is the default value, but a different (higher) value may be chosen by the user. 

^ Two is the default value, but a different (higher) value may be chosen by the user. 
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Otherwise, the key is rated as “invalid”.^ Note that all public keys directly 
signed by the owner of the key ring are “valid” . An example to illustrate PGP’s 
trust model is shown in Fig. 3. Gray circles stand for completely trusted, gray 
semicircles for marginally trusted, and white circles for untrusted public keys. 
“Valid” public keys are indicated by nested circles. A’s public key is “valid” 
because it is owned by A. The public keys of B, C, D, E, and F are all “valid” 
because they are directly signed by A. H and I are “valid” because C is “valid” 
and completely trusted. J is “valid” because E is “valid” and completely trusted. 
K is “valid” because J is “valid” and completely trusted. N is “valid” because 
both H and / are “valid” and marginally trusted. Finally, P is “valid” because 
J is “valid” and completely trusted. All other keys are “invalid”. 

The PGP trust model is unsatisfactory in many ways. First of all, although 
trust is a gradual quantity that reflects someone’s confidence in someone else’s 
reliability, PGP provides only three levels of trust. Similarly, by simply distin- 
guishing between “valid” and “invalid” public keys, PGP is not able to gradually 
rate the authenticity of the keys. Another problem is the rule that keys signed 
by at least two marginally trusted introducers are rated as “valid”. This rule 
seems to be the product of a pragmatic way of evaluating webs of trust, but it 
is certainly not the result of a proper and well-founded trust model. In fact, one 
can easily construct counter-intuitive examples such as the ones shown in Fig. 4. 

On the left hand side of Fig. 4, PGP’s trust model rates B’s public key 
as “valid”, whereas on the right hand side, it is rated as “invalid”. However, 
because there is any desired number of possible certificate chains in the web of 
trust on the right hand side, each chain including one marginally trusted and 
one completely trusted introducer, one would expect to rate the validity of H’s 
key with a much higher degree than in the web of trust shown on the left hand 
side with only two possible certificate chains. 



® PGP also defines “marginally valid” public keys, but they are considered as “invalid” 
by default. 
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Fig. 4. More examples of PGP’s web of trust. 



3 Probabilistic Argumentation 

The basic idea behind probabilistic argumentation comes goes back to the con- 
cept of assumption-based truth maintenance systems (ATMS) [5]. The goal is 
not to describe argumentation as a dialectical process, but rather to serve as a 
deductive tool that helps to judge hypotheses in the light of the given uncertain 
and partial knowledge. Hypotheses represent open questions about the unknown 
or future world. 

From a qualitative point of view, the problem is to derive arguments in favor 
and counter-arguments against the hypothesis h of interest. An argument is 
a defeasible proof built on uncertain assumptions . In other words, arguments 
are combinations of true or false assumptions that permit to infer the truth 
of the hypothesis h from the given knowledge base. Every argument provides 
thus a sufficient reason that proves the hypothesis in the light of the available 
knowledge. And it finally contributes to the possibility of believing or accepting 
the hypothesis. In other words, arguments support and counter-arguments defeat 
the hypothesis h. Note that counter-arguments can be regarded as arguments in 
favor of the negated hypothesis -•h and vice versa. 

A quantitative judgement of the situation is obtained by considering the prob- 
abilities that the arguments and counter-arguments are valid. The credibility of 
a hypothesis is measured by the total probabilities that it is supported or de- 
feated by arguments. Conflicts are handled through conditioning. The resulting 
degree of support and degree of possibility correspond to belief and plausibility, 
respectively, in the Dempster-Shafer theory of evidence [14,16]. 

For the construction of probabilistic argumentation systems, consider two 
disjoint sets A and P of propositions. The elements of A are called assumptions 
and represent uncertain events, unknown circumstances, or possible states or 
outcomes. Caup denotes the propositional language over A U P. The available 
uncertain knowledge base is then encoded by a sentence f € Cavjp- K is often 
specified by a conjunctively interpreted set S = {^i, . . . ,fr} of sentences ft G 
Laup for which f is defined by ^ A • • • A ^i.. 
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3.1 Arguments, Counter- Arguments, Conflicts 

Consider the case where a second propositional sentence h G Cakjp represents 
a hypothesis about some of the propositions in A U P. What can be inferred 
from ^ about the possible truth of h with respect to the given set of uncertain 
assumptions? Possibly, if some of the assumptions are set to true and others to 
false, then h may be a logical consequence of f. 

More formally, if Ta denotes the set of all conjunctions of non-repeating 
literals over A, then such a term a G 7a is an argument for h, if a A f \= h. 
Similarly, if a A ^ ^ ~ih, then a is a counter- argument against h. Note that 
counter-arguments are arguments for ->h. An argument a for h is called minimal, 
if there is no shorter argument a' Ga for h. The sets of all minimal arguments and 
minimal counter-argument with respect to h and f are denoted by Args(h, and 
Args{-'h,f), respectively. Note that every a G Args{h,f) increases the support 
for h, whereas every a G Args{~^h,f) decreases the possibility of h. 

If a term a G 7a is both argument and counter-argument of h, then it is called 
conflict. Conflicts are inconsistent with the knowledge base f. They represent 
impossible states of the world which have be excluded. Note that conflicts are 
arguments for T. The set of all minimal conflicts is denoted by Args(T,^). 

Consider twosetsA= {m, 02, 03}, P = {A, T}, and a knowledge base ^ given 
as a set A = {a\^X, -102— a^AY^X, 02^ -•X} of material implications. If 
X is the hypothesis of interest, then there are two arguments, one counter- 
argument, and one conflict: 

Args{X,f) = {ai,^Q2 Aas}, Args{^X,f) = {02}, Args{±,f) = {aiAa2}- 

Computing the sets Args{h,f), Args{-<h,f,), and Args{Y,f) is the main com- 
putational problem of probabilistic argumentation [8]. Efficient algorithms are 
obtained by focussing the search on the most relevant arguments [6,7]. 

3.2 Degrees of Support and Possibility 

In order to judge h quantitatively, let every assumption a G A be linked to a cor- 
responding prior probability p{a). We suppose them to be mutually independent. 
Then the probability of a term a G 7a is 

P{oi) = : a G a} • J]^{1 - p{a) : ~^a G a}. (1) 

If T C 7a is an arbitrary set of terms, then p{T) denotes the overall probability 
of all terms included in T. It corresponds to the probability that at least one term 
of T is true. Note that any such set T = {ai, . . . , a„} can be interpreted as a 
disjunctive normal form aiM ■ ■ X a„ (DNF for short). The problem of computing 
p{T) is thus equivalent to the general problem of computing probabilities of 
DNFs. For further information on this we refer to the corresponding literature, 
in particular to Darwiche’s d-DNNF compiler [4]. 

Consider now the conditional probability that at least one argument for h 
is true under the condition that none of the conflicts of f is true. This is a 
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quantitative measure of how much h is supported by arguments in the light of 
the given knowledge. In depends on the two sets Args{h,^) and 2lrgs(_L,^). If 
Args{-L, is considered as DNF, then ~'Args{-L, represents the condition that 
conflicts are impossible. This allows us to define degree of support of h as 



dsp{h,£,) = p{Args{h,£,)\^Args{±,f)) 



pjArgsjh, g)) - p{Args{±, 0) , . 

I - p{Args{±,f)) 



For a more detailed derivation of the above formula we refer to [8] . Note that de- 
gree of support is equivalent to the notion of (normalized) belief in the Dempster- 
Shafer theory of evidence [14,16]. It can also be interpreted as the probability of 
provability [11,15]. 

A second way of judging the hypothesis h is to look at the conditional prob- 
ability that no counter-argument is true under the condition that none of the 
conflicts of f is true. This is a quantitative measure of how possible h is in the 
light of the given knowledge. Thus, degree of possibility of h is defined as 



dps{h,f) = p{-^Args{-^h,f)\^Args{A,f)) = l-dsp{-^h,f). (3) 



Degree of possibility is equivalent to the notion of plausibility in the Dempster- 
Shafer theory. Note that dsp{h,£,) < dps{h,^) for all h G £-aup and ^ G Caup- 
The particular case of dsp{h,^) = 0 and dps{h,£,) = 1 represents total ignorance 
over h. 

Consider the example at the end of the previous subsection and suppose 
that p{ai) = 0.2, p(a 2 ) = 0.4, and ^(03) = 0.1 are the probabilities of the 
assumptions. The probabilities of the DNFs formed by the respective sets of 
arguments, counter-arguments, and conflicts are then as follows: 

p(Args(X, ^)) = p(ai V -■a2Aa3) = 0.248, p{Args{^X, ^)) = ^(02) = 0.4, 
p{Args{A, ^)) = p{ai A 02) = 0.08. 



Finally, according to (2) and (3), degree of support and degree of possibility is 
computed as follows: 



dsp{X,f) 



0.248 - 0.08 
1 - 0.08 



0.183, dps{X, 0 = 1- = 0-652. 



Although there is only a weak support, the hypothesis X remains quite possible. 
This is an example where gathering more information should precede any rash 
decision for or against X . 



4 Trust Evaluation Based on Probabilistic Argumentation 

We will now see how to encode a web of trust as a probabilistic argumentation 
system. Because trust can be seen as someone’s confidence in someone else’s 
reliability, we denote the reliability of an introducer X by the proposition rel{X). 
Gradual confidence in X can then be quantified by the probability p{rel{X)) of 
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X being a reliable introducer. The special case where X is a fully trustworthy 
CA is encoded by p{rel{X)) = 1. On the other hand, p{rel{X)) = 0 stands for 
the case where X deserves no trust at all. 

In a similar way, we use the proposition Val{X) to represent the case where 
X’s public key is valid. Note that there is usually no prior knowledge about how 
certain Val{X) is. It is therefore not possible to specify corresponding probabil- 
ities. 

liU = {Xq, Xi, . . . , X„} is the set of all users included in the key ring owned 
by Xg, then 



A={re/(Xi),...,re?(X„)}, P = {re;(Xo), Va;(Xo), . . . , Va;(X„)}, (4) 

are the two sets of propositions needed to build a probabilistic argumentation 
system. Note that the probabilities p{rel{Xi)), I < i < n, are specified by 
Xq, whereas Xg is implicitly assumed to be fully reliable by default. Similarly, 
because Xg’s own public key is implicitly valid, Va?(Xg) is true by default. 

Finally, in order to formulate Xg’s certificate graph as an assumption-based 
knowledge base consider the set C = {ci, . . . , Cm} of all certificates contained 
in Xg’s key ring that are issued by known users. A single certificate c G C of the 
form Xi Xj translates then into the following propositional formula: 

^(c) = rel{Xj) A Val(Xj) Val{X,). (5) 

The idea of this translation is to consider X^’s public key as valid whenever 
Xj is a reliable introducer with a valid public key. Note that this corresponds to 
Rule b) in PGP’s trust model. The complete knowledge base ^ is now determined 
by the following set X of propositional formulas: 



X = {rel{Xo),Val{Xo),aci),---,acm)}- (6) 



The certificate graph in Fig. 2, for example, includes 18 users U = {A, X, . . . , i?} 
who have issued 24 certificates (6 certificates were issued by unknown users). 
This leads to the following knowledge base ^ consisting of 26 formulas: 



X = 



'rel{A), Val(A), 

rel\A)AVal{A)^Val{B), rel{A)AVal{A)^Val{C), rel{A)AVal(A)^Val{D), 
rel\A)AVal{A)^Val{E), rel{A)AVallA)^Val{F), rel{B)AVal{B)^Val{G), 
rel{C)AVal{C)^Val{H), rel{C)AVal{C)^Val{I), rel{D)AVal{D)^Val{I), 

< rel\D)AVal{D)^Val{J), rel{E) AVal{E)^Val{J) , rel\F)AVal{F)^Val{E), 
rel\F)AVal{F)^Val{K), rel{G)AVal{G)^Val{L), rel{G)AVal{G)^Val{M), 
rel{H)AVal(H)^Val{M), rel{G)AVal{H)^Val{M), rel{H)AVal(H)^Val(N), 
rel\l)AVal{I)^Val{N), rel{I)AVal{I)^Val{0), rel{J)AVal{J)^Val{P), 
Sel\j)AVal{J)^Val(K), rel{K)AVal{K)^Val{J), rel{K)AVal{K)^Val{Q) , 



How can such a model be used to evaluate the validity of public keys? First 
of all, if it is X^’s key to be rated, then Val{Xi) is the hypothesis of interest. 
Every minimal argument a G Args{Val{Xi),^) corresponds then to a (mini- 
mal) certificate chain from Xi to Xg. On the other had, because ^ contains 
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only positive literals, there are no counter-arguments against Val{Xi). This 
implies Args{-^Val{Xi),^) = Args{-L,^) = 0 and therefore dsp{Val{Xi),^) = 
p{Args(Val{Xi),^)) and dps{yal{Xi),^) = 1 for all Xi G U. Therefore, degree 
of support is the only relevant quantity to rate the validity of Xi's public key. 
It corresponds to the probability that all introducers of at least one certificate 
chain are all reliable. 

In the example above, if it is P’s public key to be rated, there are three 
minimal arguments supporting the hypothesis Val{P): 

Args(Val{P),^) = {rel{D)Arel{J), rel{E) Arel{J) , rel{F)Arel{K)Arel{J)}. 

The first argument rel{D)Arel{J), for example, corresponds to the certificate 
chain P ^ J ^ D ^ A. Note that non-minimal certificate chains such as 
P^J^E^F^A correspond to non-minimal arguments and are thus not 
listed in the set Args{Val{P),^). 

Suppose now that A has specified the reliability of the introducers B to R 
according to the second row of the following table. The corresponding degrees 
of supports are then shown in the third row. 



A, 


A 


B 


c 


D 


E 


F 


G 


H 


I 


J 


K 


L 


M 


N 


O 


P 


Q 


R 


p{rel{Xi)) 


- 




.9 


T 




dl' 


.9 


T4 


.5 


.8 


.2 


.5 


0 


.1 


0 


.3 


.1 


.6 


dsp{Val{Xi),^) 


T 


T 


T 


T 


T 


T 


T5 




.91 


.842 


.862 


.45 


.648 


.635 


.455 


.673 


.172 


¥ 



A’s public key receives automatically maximal support. The keys of B, C, D, 
E, and F receive maximal support because they are directly signed by A. P’s 
public key receives no support because no certificate has been issued for P. All 
other keys are rated with values between 0 and 1. For example, consider the case 
of P’s public key. The corresponding degree of support dsp{Val{P) , ^) = 0.673 
is composed of the probabilities p{rel{D)Arel{J)) = 0.08, p{rel{E)Arel{J)) = 
0.64, and p{rel{F) Arel{K) Arel{J)) = 0.096 of the individual minimal certificate 
chains. 



5 Conclusion 

This paper investigates trust evaluation in public-key infrastructures based on 
probabilistic argumentation. It is remarkable how straightforwardly certificate 
graphs are expressible as assumption-based knowledge bases. Degree of support 
seems then to be an appropriate quantity to rate the validity of public keys. We 
propose the results of this paper to be taken as the basis for a more sophisticated 
trust model in cryptographic applications like PGP. 

Future work will focus on how to include negative evidence in the form of 
key revocations, recommendations about someone else’s trustworthiness, and 
dependencies between the introducers. Due to the clear conflict management 
of probabilistic argumentation and the expressive power of assumption-based 
modeling, it should not be too hard to extend the basic model accordingly. 
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Abstract. Recently, we proposed a new method called the plausibil- 
ity transformation method to convert a belief function model to an 
equivalent probability model. In this paper, we compare the plausibil- 
ity transformation method with the pignistic transformation method. 
The two transformation methods yield qualitatively different probabil- 
ity models. We argue that the plausibility transformation method is the 
correct method for translating a belief function model to an equivalent 
probability model that maintains belief function semantics. 



1 Introduction 

Bayesian probability theory and the Dempster-Shafer (D-S) theory of belief func- 
tions are two distinct calculi for modeling and reasoning with knowledge about 
propositions in uncertain domains. In a recent paper [1], we have argued that 
these two calculi have roughly the same expressive power. Also, in [2,3], we 
have proposed a new method, called the plausibility transformation method, for 
transforming a belief function model to an equivalent probability model. 

In this paper, we compare two techniques — the pignistic transformation [II] 
and the plausibility transformation — for transforming a belief function model to 
a Bayesian probability model. In many cases, these two methods lead to radically 
different probability models starting from the same belief function model. We 
argue that the plausibility transformation method is the correct method and 
that it provides an equivalent probability model that is consistent with belief 
function semantics. 

There are many different semantics of belief functions, including multivalued 
mapping [5], random codes [9], transferable beliefs [11], and hints [7], that are 
compatible with Dempster’s rule of combination. However, the semantics of belief 
functions as upper and lower probability bounds on some true but unknown 
probability function are incompatible with Dempster’s rule [12]. In this paper, 
we are concerned with the D-S theory of belief functions with Dempster’s rule 
of combination as the updating rule, and not with theories of upper and lower 
probabilities that admit various other rules for updating beliefs. One benefit of 
studying probability functions derived from D-S belief functions is a more clear 
understanding of D-S belief function semantics. 
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The remainder of this paper is organized as follows. Section 2 contains nota- 
tion and definitions associated with probability theory and the Dempster-Shafer 
theory of belief functions. Section 3 defines the pignistic and plausibility trans- 
formation methods. Section 4 describes three examples that are studied in great 
detail. Section 5 contains four theorems that define the properties of the plau- 
sibility transformation. In Sect. 6, we summarize and conclude. Proofs of all 
theorems can be found in [2] . This paper is extracted from a larger unpublished 
working paper [2]. 

2 Notation and Definitions 

This section establishes notation and definitions that will be used throughout 
the paper. 



2.1 Probability Theory 

A probability potential Ps for s is a function Pg : fig ^ We express 

our knowledge by probability potentials, which are combined to form the joint 
probability distribution, which is then marginalized to the relevant variables. 



Projection of States. If {w,x,y,z) is a state of {W, X,Y, Z}, for example, 
then the projection of {w, x, y, z) to {W,X} is simply {w, x), which is a state of 
{W,X}. If s and t are sets of variables, s C t, and a; is a state of t, then x^^ 
denotes the projection of x to s. 



Combination. Combination in probability theory is “pointwise” multiplication 
of potentials followed by normalization. Suppose Pg is a probability potential for 
s and Pt is a probability potential for t. Then Pg ® P* is a probability potential 
for sUt defined as follows: 

{Pg ® Pt){x) = K-^Pg{x^^)Pt{x^*), (1) 

for each x G f^sut, where K = ^{Pg{x^^)Pt{x^*) \ x G f^gut} is the normaliza- 
tion constant. 



Marginalization. Marginalization in probability theory involves addition over 
the state space of the variables being eliminated. Suppose Pg is a probability 
potential for s, and suppose X G s. The marginal of Pg for s \ {X}, denoted by 
pUs\{x})^ is the probability potential for s \ {X} defined as follows: 

pU^\{^}){y) = J2{Pg{y,x) I a: G Cx}, (2) 



for all y G 
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2.2 Dempster-Shafer Theory of Belief Functions 

A Dempster-Shafer basic probability assignment (bpa) assigns values to subsets 
of the state space. If fig is the state space of a set of variables s, a function 
m : — >■ [0, 1] is a bpa for s whenever m(0) = 0 and 

^{m(a) I a e 2^-’} = 1. (3) 

A bpa can also be stated in terms of a corresponding plausibility function. 
The plausibility function corresponding to a bpa m for s is defined as 
Plm : 2^® — >■ [0, 1] such that for all a G 2^“ , 

P;,„(a) ^{m(b) I bnayf 0}. (4) 



Projection and Extension of Subsets. If r and s are sets of variables, r C s, 
and a is a nonempty subset of f2g, then the projection of a to r, denoted by a'^’’, 
is the subset of 12^ given by a'^’’ = {a;-*'’’ | cc G a}. 

By extension of a subset of a state space to a subset of a larger state space, 
we mean a cylinder set extension. If r and s are sets of variables, r C s, and a is a 
subset of Qr, then the extension of a to s is ax Let a^^ denote the extension 
of a to s. For example, if a is a subset of 12{vv.x} > then a^f = a x f2{Y.z} ■ 



Combination. Calculation of a joint bpa is accomplished by using Dempster’s 
rule of combination [5]. Consider two bpa’s uia and uib for a and 6, respectively. 
The combination of m-A and tub, denoted by mA®rnB, is the bpa for aU6 given 

by 



{ruA © ms)(c) = K ^ ^{mA(x)ms(y) | n = c} (5) 

for all nonempty c C Qaub, where AT is a normalization constant given by AT = 

E{mA(x)mB(y) I (xt(“Uf-)) n (yt(“Ufc) ^ 0}. 



Marginalization. Suppose m is a bpa for s, and suppose t C s. The marginal 
of m for t, denoted is the bpa for t defined as follows: 

m"*'*(a) = ^{m(b) | b-*"* = a} (6) 



for each a C f2t- 

3 Transformation Methods 

In this section, we define the pignistic transformation and the plausibility trans- 
formation methods for converting belief functions to probability functions. 
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3.1 Pignistic Transformation 

Suppose m is a bpa for s. Let Bet Pm denote the corresponding probability 
function obtained using the pignistic transformation method [10,11]. BetPm is 
defined as follows: 

BetPmix) = (^) 

a C 
X € a 

for each x G f^s- To simplify terminology, we will refer to the BetPm as a pignistic 
probability function (corresponding to bpa m). 

3.2 Plausibility Transformation 

Suppose m is a bpa for s. Let Plm denote the plausibility function for s cor- 
responding to bpa m. Let PLPm denote the probability function for s corre- 
sponding to m obtained using the plausibility transformation method. PPPm is 
defined as follows: 

PLPmix) = K-^Plm{{x}) (8) 

for all X G f2s, where K = ^{Plm{{x}) \ x G l7s} is the normalization constant. 
To simplify terminology, we will refer to PPPm as the plausibility probability 
function (corresponding to bpa m). 



4 Three Examples 

The examples in this section will highlight the differences between the pignistic 
and plausibility transformation methods. 

4.1 Example 1: Peter, Paul, and Mary [11] 

A mafia don, the Godfather, has three assassins, Peter, Paul, and Mary. Needing 
to assassinate an informant, Mr. Jones, the Godfather decides to first toss a fair 
coin to decide the sex of the assassin. If the toss results in heads, he will pick 
Mary for the job. If the toss results in tails, he will ask either Peter or Paul to 
do the job. In the case of tails, we have no knowledge of how the Godfather will 
select between Peter and Paul. Now suppose we find Mr. Jones assassinated. 
An informant in the mafia organization has informed the district attorney (DA) 
about the Godfather’s incomplete mechanism for choosing among Peter, Paul, 
and Mary. The DA would like to indict Peter, Paul, or Mary (in addition to the 
Godfather). Who should the DA indict? 

Let A denote the assassin variable with three states: Peter, Paul, and Mary. 
Given our knowledge of the incomplete protocol of how the assassin was se- 
lected, we can represent it by the bpa mi for A as follows: nii{{Mary}) = 0.5, 
mi{{Peter, Paul}) = 0.5. The pignistic probability function corresponding to 
mi is as follows: BetPm^{Mary) = 0.5, BetPmi{Peter) = BetPmi{Paul) = 
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0.25. The plausibility probability function corresponding to mi is as follows: 
PLPmi{Mary) = PlJ^rm{Peter) = PPP^iiPaul) = 1/3. The pignistic trans- 
formation completes the Godfather’s incomplete selection protocol by dividing 
the 0.5 probability equally between Peter and Paul. We refer to this assignment 
of equal probabilities as a random choice protocol. The plausibility transforma- 
tion makes no assumption about the mechanism that will be used. The mafia 
don may always prefer Peter to Paul, or perhaps Paul to Peter. Using standard 
belief function semantics, there is a 0.5 chance that Mary is not the assassin, a 
0.5 chance that Peter is not the assassin, and a 0.5 chance that Paul is not the 
assassin. This explains the plausibility probability function PLPmi- 

Clearly, the two transformation methods yield qualitatively different results 
starting from the same bpa mi . Which probability distribution can be considered 
as equivalent to mi? In the following paragraphs, we describe one argument 
(flawed, in our opinion) in favor of the pignistic transformation method and two 
arguments (compelling, in our opinion) in favor of the plausibility transformation 
method. 

Consider the following argument in favor of the pignistic transformation 
method^. 

There is exactly one “argument” for Mary and one “counter-argument” 
each for Mary, Peter and Paul, respectively, as follows [6]: 





Arguments 


Counter -arguments 




Bel 


PI 


Mary 


Heads 


Tails 




0.5 


0.5 


Peter 


-- 


Heads 




0 


0.5 


Paul 


- 


Heads 




0 


0.5 



A transformation method should take both arguments and counter- 
arguments into account. The pignistic transformation method considers 
both in this example by averaging the weights of arguments and counter- 
arguments. On the other hand, the plausibility transformation method 
takes only counter-arguments into account (ignoring arguments). 

What this argument fails to notice is that the counter-arguments for Peter and 
Paul are exactly the same as the argument for Mary. Thus, in averaging the 
weights of arguments and counter-arguments, we are selectively double-counting 
information, violating a fundamental tenet of uncertain reasoning. A belief func- 
tion has exactly the same information as the corresponding plausibility function, 
PI{sl) = 1 — BcI{Qa\ &)• By ignoring arguments, the plausibility transformation 
method avoids double counting uncertain information. 

One way to resolve the conflict between BetP and Pl_P is to appeal to the 
property of idempotency. Suppose we have two pieces of identical, independent 
evidence about the assassin, both equal to the bpa mi . If we use Dempster’s rule 
to combine these two pieces of evidence, we observe that mi © mi = mi, i.e., 

^ This argument was provided by Rolf Haenni [private communication]. 
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mi is idempotent. PLPmi is also idempotent, i.e., PLPmi ® PLPmi = Pl P-mi- 
However, notice that BetPm^ is not idempotent. Denoting BetPmi ®BetPmi by 
BetPm, we have BetPm{Mary) = 2/3 and BetPm{Peter) = BetPm{Paul) = 
1/6. Idempotency is an important qualitative property of uncertain knowledge 
because double-counting of idempotent information is harmless. 

Continuing the Peter, Paul or Mary saga, suppose we subsequently learn 
that Peter has a cast-iron alibi during the time Mr. Jones was assassinated. 
This piece of evidence can be represented by the bpa m2 for A as follows: 
m 2 {{P aul , M ary}) = 1. If we combine the two independent bpa’s m\ and m2, 
we get (mi ©m2)({PaM^}) = {mi(Bm 2 ){{Mary}) = 0.5. Since the joint bpa has 
only singleton focal subsets, both the pignistic and plausibility probability func- 
tions corresponding to mi©m2 agree: BetPmiCBm^iPaul) = Pl-Pmi<s,m 2 {Paul) = 
i?etPmi©m2 = PPPmi®m:i{^ary) = 0.5. However, if we were using the 
pignistic probability distribution BetPmi, we update this probability distri- 
bution (using Bayes rule) with the evidence of Peter’s alibi (represented with a 
likelihood vector that has 0 for Peter and I’s for Paul and Mary), we end with 
a probability distribution for A that has probability 2/3 for Mary and 1/3 for 
Paul, a result that does not coincide with BetPmi^m 2 - tbe other hand, if 
we were using the plausibility probability distribution Pl_P^^, and we update 
this distribution with the evidence of Peter’s alibi, the result is a probability 
distribution for A that has probability 1/2 for Paul and 1/2 for Mary, exactly 
the same probability distribution as PPPmi<s,m 2 - 



4.2 Example 2: Counter-Example [10] 

Consider a bpa m for a variable H with state space Oh = {hi, . . . ,hro} as 
follows: m{{hi}) = 0.30, m({/i2}) = 0.01, m({/i2, ^3, ■ ■ • , h^o}) = 0.69. For this 
bpa m, the pignistic probability function BetPm is as follows: BetPm{hi) = 
0.30, BetPm{h 2 ) = 0.02, BetPm{h^) = ... = BetPm{hyo) = 0.01. The un- 
normalized plausibility probability function Pl_P}^ is as follows: PLP{^{hi) = 
0.30, Pl-P'm{h2) = 0.70, Pl-P'm{hz) = ... = PLP'mihm) = 0.69. 

Clearly, the two probability functions are very different. The pignistic prob- 
ability function has hi 15 times more likely than /12 whereas the plausibility 
probability function has ft-2 2.33 times more likely than hi. Our interpretation is 
that the pignistic transformation uses a random protocol where the probability 
of 0.69 is divided equally amongst the 69 states /12,. ■ .,/i7o- Smets [10] argues 
that the originality of Shafer’s model is that — unlike probabilistic models — it 
does not resort to an argument of symmetry to arbitrarily split belief assigned 
to non-singleton subsets into equal parts; however, we interpret the pignistic 
transformation as performing this very allocation. 

Shafer [8] states that m{A) should be interpreted as the probability mass 
that is “confined to A but can move freely to every point of A” (p. 40). In this 
example, we have belief of 0.70 against hi, a belief of 0.30 against /12, and a 
belief of 0.31 against /13,. . .,/i7o. Rather than use a random choice protocol, the 
plausibility transformation assumes that all mass can move freely to any state in 
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the focal element of the belief function, which is consistent with belief function 
semantics. 

Another compelling argument for the plausibility transformation method is 
as follows. Consider an hypothetical situation where we have n independent 
pieces of evidence, all exactly equal to m. Combining these n pieces of evidence 
by Dempster’s rule yields m”. For n > 500, we observe that m”({h- 2 }) « 1, so 
the result is more consistent with PLPm (that has /12 as the most probable state) 
than with BetPm (that has hi as the most probable state). Notice that if we 
combine PPPm n times using Bayes rule (or pointwise multiplication) and denote 
the result by (PCP^)”, for large n we get the result that {Pl-Pm)^{h 2 ) ~ 1- 



4.3 Example 3: Target Identification Problem [4] 

A target identification system is composed of 30 sensors, Si, i = 1, . . . , 30. Each 
sensor Si is in one of two states Xi or yi. The state of the sensors depends on an 
unknown target that is assumed to be in one of two states: t\ denoting friend, or 
^2 denoting foe. The state of each sensor also depends on whether it is working 
or not. When in working condition, a sensor reading of Xi correctly identifies 
a target of type t\ and a sensor reading of yi correctly identifies a target of 
type ^ 2 - When the sensors are not in working condition, nothing is known about 
what the sensor readings mean. The first 11 sensors S'!,. . .,<S'n are high quality 
sensors, and the remaining 19 sensors S' 12 ,. . . ,S' 3 o are low quality sensors. A high 
quality sensor has a 99% probability of being in working condition whereas a low 
quality sensor has only a 90% probability of being in working condition. Data in 
the form of sensor readings is collected as follows: xi,. . ■,xio,yu,xi 2 ,yi 3 ,. ■ .,j/ 3 o- 
What conclusions can we draw about the actual target type? 

First, we will represent the evidence from the 30 sensors by bpa’s and compute 
the joint belief function for T. Subsequently, we will represent the evidence 
by probability functions using the pignistic transformation and the plausibility 
transformation, in each case computing the joint probability function for T. 

Table 1 shows the data collected from the sensors represented as evidence in 
bpa’s. We can reach a conclusion about the target identity by calculating the 
joint bpa for the 30 sensors. Using Dempster’s rule, the joint bpa m is given by 
m = TOi © . . . © mso- The results are presented in Table 2. Thus, as per the belief 
function model, the target is approximately 10 times more likely to be a friend 
than a foe. 



Table 1. Bpa encoding of sensor readings 



Sensor Si = Xi 


Sensor — yu 


Sensor S '12 = 2:12 


Sensor S 
i = 13,. 


i = Vi 
..,30 


a C 


mi(a) 


a C 2^^ mil (a) 


a C 2^2’ 


mi2(a) 


a C 2 ^ 2 ’ 


mi(a) 


{u} 


0.99 


{t2} 0.99 


{u} 


0.90 


{^2} 
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{tiM} 
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0.01 




0.10 


{tiM} 
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Table 2. Joint Bpa and Plausibility Functions for 30 Sensors 



a e 2 "^ 


Un-normalized bpa 


Normalized bpa (m) 


Plausibility (Plm) 


0 


« 1 


0 


0 


{tl} 


« 1.00 X 10"^° 


0.9091 


0.9091 


{t2} 


« 1.00 X 10~^^ 


0.0909 


0.0909 




« 1.00 X 10~"‘^ 


0.0000 


1 



Table 3. Pignistic Probability Function Encoding of Sensor Readings 



Sensor Si = Xi Sensor Sn = yw Sensor 512 = *12 Sensor Si = yi 



i = 


1,...,10 










i = 


13, ...,30 


xGOt 


BetPrmix) 


x^Qt 


BctPrn^^l (^) 




BctPm\2 (^) 


xGPt 


BetPmi (x) 


tl 


0.995 


tl 


0.005 


tl 


0.95 


tl 


0.05 


t2 


0.005 


t2 


0.995 


t2 


0.05 


t2 


0.95 



Table 4. The Joint Pignistic Probability Model for the Target Identification Problem 



X G 


Un-normalized Probability 


Normalized Probability 


tl 


« 1.723A - 26 


« 0.0820 


t2 


« 1.930E - 25 


« 0.9180 


Sum 


« 2.102A - 25 


1 



Next, we will model this problem using probabilities from pignistic trans- 
formations of the 30 belief functions. The probability functions are shown in 
Table 3. The results of combining the 30 probability functions using pointwise 
multiplication and normalizing the resulting probability function are presented 
in Table 4. 

Notice that the pignistic probability model of the target identification prob- 
lem is qualitatively different from the belief function model. As per the pignistic 
probability model, the probability that the target is a foe is approximately 11 
times more likely than the probability that the target is a friend. In general, 
if mi and m 2 are two bpa’s on the same domain, then {BetPmi ® BetPm^) ^ 

Next, consider the probability model for the target identification problem 
obtained from the belief function model using the plausibility transformation. 
This model is shown in Table 5. If we combine the 30 plausibility probability 
functions using pointwise multiplication and normalize the resulting probability 
function, we obtain the results in Table 6. 

Notice that the conclusion is similar to the result obtained in the belief func- 
tion model. In the next section, we will show that this equivalence between the 
belief function model conclusion and plausibility probability function is always 
true. 
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Table 5. Plausibility Probability Function Encoding of Sensor Readings 



Sensor Si = Xi 
i = 1,...,10 


Sensor Sn = yii 


Sensor 512 = xi 2 


Sensor Si = yi 
i = 13, ...,30 


x^Qt 


PLPrmix) 




PLPmii (x) 


X^f2T 


PLPmi2{x) 


x£Qt 


Pl-P-mAx) 




0.9901 


tl 


0.0099 


tl 


0.9091 


tl 


0.0909 


t2 


0.0099 


t2 


0.9901 


t2 


0.0909 


t2 


0.9091 



Table 6. The Joint Plausibility Probability Model 



X G 


Un-normalized Probability 


Normalized Probability 


tl 


« 1.4656E- 21 


« 0.9091 


t2 


« 1.4656E- 22 


« 0.0909 


Sum 


« 1.6121E’- 21 


1 



5 Justification and Properties of the Plausibility 
Transformation 

In all three examples described in the previous section, there is a discrepancy 
between the pignistic probability function(s) obtained before and after combin- 
ing all evidence. Smets [10] resolves this apparent discrepancy of the pignistic 
transformation by stating that beliefs are held at the credal level and one only 
descends to the probability space for decision making at the time a decision has 
to be made. However, we view decision-making as a dynamic activity. 

Probability theory and belief function theory are two uncertainty calculi with 
roughly the same expressive power [1]. One should get roughly the same results 
regardless of the calculi one is using to represent knowledge if the models built 
using the calculi are equivalent. An appropriate transformation method can allow 
a model of an uncertain domain in one calculus to be translated into the other. 
Thus we can exploit the advantages of both calculi. 

The pignistic transformation is justified based on a so-called “rationality” 
requirement, which implies a mathematical requirement of linearity. Other jus- 
tifications for the pignistic transformation are given in [10,11]. Some intuitive 
justifications for the plausibility transformation are given in [2,3]. Here we will 
state four theorems that demonstrate that the plausibility transformation is con- 
sistent with belief functions semantics. 

Theorem 1. Suppose m\,. . .,mk are k bpa’s. Suppose Plmi,- ■ ;Plmk 
are the associated plausibility functions, and suppose PLPmi,- ■ ■,Pl-Pmk 
are the corresponding probability functions. If to = toi © ... © ruk is 
the joint bpa, Plm is the associated plausibility function, and Pl-Pm is 
the corresponding plausibility probability function, then Pl_Pm^ © ... © 
Pl-Pruk = Pl-Pm- 
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Fig. 1. A pictorial depiction of Theorem 1 



Theorem 1 is depicted pictorially in Fig. 1. Notice that from a computational 
perspective, it is much faster to compute Pl-Prm ® ® Pl-Pm^ than it is to 

compute PPPm (since the latter involves Dempster’s rule of combination and 
the former involves Bayes rule). 

Given bpa m, we don’t view PPPm as an approximation of m. Instead, 
we view PPPm as an equivalent probability encoding of the information in m. 
Thus if we have a belief function model consisting of {mi,...,mfc}, then we 
view ) P^-Pruk} as an equivalent probability model. Theorem 1 can 

be viewed as a regularity condition for any transformation method. As demon- 
strated in the Peter, Paul, and Mary, and the target identification problems, the 
pignistic transformation does not satisfy this condition. 

If a unique state x exists in a bpa m such that Linin^ao'nP {{x}) = I (where 
m" = m © ... © TO, n times), an equivalent probability function should have x 
as its most probable state. This property is satisfied for the plausibility trans- 
formation, as stated in the following theorem. 

Theorem 2. Consider a bpa to for s (with corresponding plausibil- 
ity function Plm) such that x G is the most plausible state, i.e., 
Plm{{x}) > Plm{{y}), for all y G Hs \ {x}. If Plm°° denotes the 
plausibility function corresponding to m°°, then Plm<=°(ix}) = 1, and 
Plni<^{{y}) = 0 for all y G Hs\ {x}. 

In Example 2 presented in Sect. 4, to®°°({/i 2 }) ~ 1, so the most plausible 
hypothesis in to is / 12 , consistent with PI -Pm and not BetPm- 

If a bpa function has a subset of most plausible states, all with equal plausi- 
bility, the following theorem applies. 

Theorem 3. Consider a bpa to for s (with corresponding plausibility 
function Plm) such that t C 17^ is a subset of most plausible states, i.e., 
Plm({x}) = Plm{{y}) for all x,y Gt, and Plm{{x}) > Plm{{z}) for all 
X G t, and z G I7s\t. Then there exists a partition {ai, . . . ,afc} of t such 
that m°°{sLi) = 1/fc for t = 1, . . . , fc, i.e., Plm°° (x) = Plm°°{y) = l/A for 
all X, y G t, and Plm°° (z) = 0 for all z G f?* \ t. 
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In the Peter, Paul, and Mary saga described earlier, the initial belief function 
mi has a corresponding plausibility function where each state has equal plausi- 
bilities. Theorem 3 applies with oi = {Mary}, 02 = {Peter, Paul}, and k = 2. 
The next theorem states that PLPm is idempotent if m is idempotent. 

Theorem 4. If m is idempotent with respect to Dempster’s rule, i.e., 

m © m = m, then Pl_Pm is idempotent with respect to Bayes rule, i.e., 

PI. Pm © PI -Pm = Pl-Pm- 

As demonstrated in the Peter, Paul, and Mary example, BetPm does not 
satisfy this property. 

6 Conclusions and Summary 

In summary, if T transforms a bpa m in a belief function model to an equivalent 
probability function T{m), T should satisfy four basic properties: 

1) . Invariance with respect to combination-. T{mi © ... © m„) = T{mi) © 
. . . © T(mn), which is satisfied for the plausibility transformation, according to 
Theorem 1; 

2) Unique most plausible state: Limn^ooT^ i'm) (hi) = 1 if Limn^ao'mP{hi) = 
1, which is satisfied for the plausibility transformation according to Theorem 2; 

3) Non-unique most plausible states: If Lzm„_>ooP^m"(a;) = 

Lzm„_>ooP^m"(y) for all a;,y G t C and Lim (z) = 0 for all 

z G \ t, then Lzm„_,.ooP"(w)(a;)=Ltm„_,.ooP"(w)(?/) for all x,y & t, and 
Limn^ooT'^{m){z) = 0 for all z G \ t; this property is satisfied for the 
plausibility transformation according to Theorem 3; and 

4) Idempotency: T(rn) is idempotent if m is idempotent, which is satisfied by 
the plausibility probability transformation according to Theorem 4. 

The main goal of this paper is to compare the pignistic and plausibility 
transformation methods for transforming belief function models to probability 
models. Until now, most of the literature on belief functions has used the pig- 
nistic method. The pignistic transformation method does not satisfy the invari- 
ance with respect to combination, most plausible, and idempotency axioms. On 
the other hand, the plausibility transformation satisfies all intuitively accept- 
able axioms we have postulated for an acceptable transformation method. We 
conjecture that the plausibility transformation method is the only method that 
satisfies these axioms, but we don’t have a proof of this claim. 
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Abstract. In this paper a general framework consisting of fuzzy 
database matching and evidential reasoning is presented. Data is 
matched onto a database in a fuzzy, i.e. quantified, way. Pieces of evi- 
dence are herefrom constructed. These update belief measures connected 
to the elements of the database, using a simple support belief function. 
A sorting and grouping of the database elements, and thresholding the 
beliefs, makes the process stepwise. A qualitative, unambiguous decision 
support is obtained at every step. The threshold and the maximum belief 
for a piece of evidence are the parameters varied. Some properties of the 
framework are examplified in a case study of identifying air targets. For 
a given ratio of the two parameters, the identification performance shows 
a surprising non-monotonicity with respect to the threshold. 



1 Introduction 

When matching data onto a database a certain “softness” can be useful. For ex- 
ample, if data in the database is approximative, a soft matching process might 
prevent a mismatch. On the other hand, a critical decision needs easily inter- 
preted decision support. A quantitative measure which is truly meaningful and 
unambiguous is hard to provide. Here we investigate some properties of a system 
where the quantitative measures are thresholded, providing a stepwise process 
with qualitative output. 

In [1] an evidential reasoning identification scheme based on partial proba- 
bility models is shown. Its advantages are concluded as: It is better to be only 
partially but correctly informed than to risk being completely but incorrectly in- 
formed. Here the same guiding star is followed, however with no likelihood in- 
formation at all. It is believed that even partial probabilities are hard to find, 
especially in military applications as the one in the case study below. 

2 The Theory of Evidential Reasoning 

For a thorough description of the theory of belief functions, see e.g. [12]-[13]. [10], 
[11] and [15] deals with both belief functions and Dempster’s rule of combination. 
Below, a somewhat specialised description of the concepts and theories follow. 
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2.1 Belief and Belief Functions 

A property called belief, or belief mass, is assigned to different hypotheses. The 
total belief that has been assigned directly to a hypothesis Hi is here denoted 
Bel{Hi), and describes to what degree hypothesis Hi is supported by so far 
gathered evidence. It takes values in [0,1]- The total belief mass spread out 
among a set of hypotheses sum to unity. Thus, belief resembles probability. 
Some important distinctions between probability and belief should however be 
made: 

1. belief is not a statistical property, i.e. has no interpretation as the frequency 
of a certain outcome in a random process, and 

2. implications as 



A = B\JC ^ Bel{A) < Bel{B) + Bel{C) (1) 

Bel{A) = c Bel{—>A) = 1 — c (2) 



need not hold. 

(l)-(2) above indicate that evidential reasoning is in a way more general than 
probability theory. Whether this is so, in an actual application, is determined 
by the choice of belief function (bf). The bf determines how gathered evidence 
is used to update the beliefs. A hypotheses to which a bf assignes mass is called 
a focal element of the bf. 

A bayesian bf distributes the belief of an evidence among a set of exhaustive, 
disjunct hypotheses. For a bayesian bf (l)-(2) are true. 

In the study of this paper a simple support bf will be used. Such a bf assignes 
belief to 

1. one of the hypotheses, and to 

2. the set of all hypotheses - the frame of discernment. 

The belief in the frame of discernment (fd), denoted Bel{9), is belief not dis- 
tributed among the true hypotheses. Instead, it is assigned to the hypotheses 
as a whole. This is useful when one does not know how to distribute the belief 
mass. Therefore, it can be seen as a measure of the ignorance in the system. 

The property Bel{~'A), see above, can be as interesting as Bel{A) in an 
application. It is represented through the plausibility for A, which is denoted 
Pl{A) and given by I — Bel{~^A). 



2.2 Dempster’s Rule of Combination 

Dempster’s rule of combination is a stringent way to combine quantified evi- 
dence. Suppose a belief function has assigned belief mass to a set of focal ele- 
ments indexed with i, according to some evidence. Also, it assigns belief to a set 
of focal elements indexed with j, according to some other piece of evidence. The 
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belief masses are denoted {m{Hi)} and {m{Hj)}, respectively. Then, the belief 
mass implicitly assigned directly to some third hypothesis Hf^ is given by 

m{H,)m{Hj) . (3) 



K is the conflict, given by 



^ . (4) 



The conflict is thus a measure of how much the two pieces of evidence contradicts 
each other. Note in (3) that m{Hk) increases with K. This can be interpreted 
as en elimination; if there is a a large conflict between {Hi} and {Hj} many 
possibilities are eliminated and the belief is concentrated to the intersection(s) 
of {Hi} and {Hj}. 

Dempster’s rule is especially applicable in situations where evidence and/or 
hypotheses are hierarchichally structured, or otherwise non-disjunct. Also, “neg- 
ative evidence”, i.e. m{-'Hk), can take part in the combination. These proper- 
ties make evidential reasoning generally applicable. They can however also be a 
drawback, since the intersection between two hypotheses (see (3)) might form 
a new, unwanted hypothesis, or even a nonsense hypothesis. Such problems are 
discussed and addressed in [3] and [11]. 

In this study Dempster’s rule will be used to write down update rules for 
the beliefs. With the simple support bf only two update rules, or mappings, are 
needed. If Hi is a focal element Bel{Hi) undergoes 



Bel{H,) 



m{H,)Bel{H,) 

l-K 



(5) 



otherwise it follows 



Bel{Hi) 



m{9)Bel{Hi) 
1 - K 



( 6 ) 



Bel{6) can always be computed using the fact that the total belief must sum to 
unity: 

Befle) = 1 - XI BeflHi) , (7) 

i 

but an update rule can be formulated as well: 



Bel{0) 



m{9)Bel{9) 
1 - K 



(8) 



3 Database Matching with the Simple Support Belief 
Function 

The hypotheses, characterised by some data, are stored as database elements. 
Incoming data is then compared to, or matched against, the stored data for each 
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Fig. 1. The matching function: On the horisontal axis is the incoming data sample; 
a value of some property. On the vertical axis is the belief mass, m, of the resulting 
piece of evidence. Values in I match the stored data and yield maximum belief, rrimax- 
Values outside I yield belief masses that are damped towards zero according to their 
distance to I 



hypotheses, respectively. Each matching results in a piece of evidence. The simple 
support bf assigns the mass to the hypotheses, and to the fd. If the incoming data 
matches the stored data for a hypotheses, the amount of belief mass assigned to 
that hypothesis is rrimax- If not, the belif mass is smaller, but always non-zero. 
Figure 1 shows (roughly) this fuzzy matching. In the figure the stored data is an 
interval and the incoming data is a single value (along the horisontal axis). 

Constructing pieces of evidence based on Euclidian distances has earlier been 
proposed by Denceux; [4]- [5] . The methods of Denceux does however require train- 
ing data. Here, the data available is just the uncertain estimates of the data 
characterising each hypotheses. 

In order to avoid matching the incoming data against all database elements, 
the elements are grouped into a tree structure, so that the matching is made 
hierarchically - coarse-to-fine. The grouping of the elements require that these 
can be sorted. If data describing the elements is multidimensional, one sorting 
for every dimension is generally needed. 

In the tree of hypotheses each set of sibling nodes make up their own fd. Note 
this difference from [3] and [11], where evidence and/or hypotheses are naturally 
hierchical. Matching starts at the children of the root. When the belief for one 
of those nodes becomes greater than some threshold value, thr, the matching 
process proceeds down the subtree of that node. This process is repeated until 
the belief in a leaf node reaches the threshold value. A data sample that leads 
to reaching a threshold is immediately used again, in the new fd. 

rrimax and thr are parameters in the study of this system. Furthermore, mmax 
is kept < 1. Thus, 0 < 1 — mmax < m{9) < 1. 
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4 Case Study: Air Target Identification 

4.1 Flight Envelopes 

As an air target is being tracked, its position and velocity is normally estimated. 
From this an estimate of the acceleration can be calculated. If also the orientation 
of the aircraft is being tracked,^ the g-load can be estimated. Synthetic data for 
height, speed and g-load, together forming a composite attribute, will be used 
as input in this study. 

Limits in the combinations of in-flight kinematical attribute values, for a 
certain aircraft type, are generally referred to as its flight envelope. Height, speed 
and g-load are here chosen as flight envelope attributes because 

— they should be reasonably feasible to track/estimate 

— information on limitations in these attributes are readily available for many 
air targets 

— limitations in these attributes (separately or combined) make up large parts 
of the actual flight envelope 

The actual flight envelope should involve more or less complicated dependencies 
between the three attributes, as well as others. However, for simplicity, the lim- 
itations are here chosen to be independent. Thus, our flight envelopes are boxes 
in height-speed-g-space, see Fig. 2. 




Fig. 2. The flight envelopes are chosen to be the trivial combination of limitations in 
height(h), speed(s) and g-load(p), respectively 



^ a numerical tracking method, snch as particle filtering[6], is believed to be needed 
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4.2 The Database 

The flight envelope is here considered a single but multidimensional attribute. 
Therefore, only one sorting is performed. The way to sort the flight envelopes 
accordingly is not straightforward. Here, a Bounding Volume Hierarchy 
is created: 

1. And the minimal volume enclosing all multidimensional objects (flight en- 
velopes) 

2. divide the objects into two groups by splitting the minimal volume along its 
longest axis, into two volumes 

3. for each volume, go back to 2. and repeat the process until no volume contains 
more than one object 

A few additional conditions are needed: 

— If an object is cut into two pieces in the split, it is assigned to the volume in 
which its largest part resides. This volume must then be enhanced according 
to the object. 

— In 2. we interpret “longest” as “longest relative”, so that the extension of 
the objects along the particular axis matters. Since in this case we deal 
with boxlike objects, we compute the mean box length along every axis. The 
length of the bounding volume (see 1.) is then divided by this mean, yielding 
the number for comparison in 2. 

The database for this study was taken to consist of the target hypotheses 
shown in Table 1.^ 



Table 1. The target types and their flight envelopes 



Type 


heights (m) 


speeds (km/h) 


g-loads 


Apache 


0-4570 


0-365 


-0.5-3.5 


F 16 


0-15240 


0-2124 


-5*-9 


Hawk 200 


0-13715 


0-1065 


-4-8 


Jaguar S 


0-14000 


0-1699 


-4.6*-8.6 


Mig 23 B 


0-16800 


0-1900 


-4*-7 


Mig 29 


0-17000 


0-2445 


-5*-9 


Mirage 2000 C 


0-18000 


0-2338 


-5*-9 


Su-27 


0-17700 


0-2280 


-4*-8 


Tornado IDS 


0-15240 


0-2338 


-3.9*-7.5 



Most values were taken from [7]. Others, marked with *, have been calculated 
using the approximative equation gmin = —Q-(i{gmax — !)• This was taken from 
[8]. The BVH of the flight envelopes for the target type hypotheses is seen in 
Fig. 3. 

^ The target types were chosen on an entirely non-political basis. 
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^ache 


Mig 29 


F16 
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Hawk20Q 


Su 27 


Jaguar S 


Tornado IDS 


Mig23B 






Apache 




F16 


Hawk 2Q0 




Mig 23 B 


Jaguar S 




Mig 29 






Mirage 2000 C 
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Fig. 3. The Bounding Volume Hierarchy of the aircraft flight envelopes used in this 
study 



4.3 More on Matching with a Simple Support BF 

In our case every fd is made up of two disjunct hypotheses, Hi and i? 2 - Data is 
matched against both hypotheses in a fd, i.e the two hypotheses take turns being 
the focal element. Each of the two beliefs thus undergo both (5) and (6) for each 
incoming data sample. Accordingly, Bel{9) undergoes (8) twice. Combining (5) 
and (6) into one mapping, regardless of order(!), gives: 

T^„](TT\, (1 - m{H 2 )){Bel{Hi) + m{Hi)Bel{0)) 

^ l-m{Hi)Bel{H2) -m{H2)Bel{Hi) -m{Hi)m{H2)Bd{e)' ^ ^ 

The mapping for H 2 is obtained if ’1’ and ’2’ is interchanged in (9). 

The only valid fixed points[l4] for {Bel{Hi), Bel{H 2 )) are {Bel{Hi), 
Bel{H 2 )) = (1,0) and (Bel{Hi), Bel{H 2 )) = (0,1). However, for the special 
case m{Hi) = m{H 2 ) every pair of values {Bel{Hi),l — Bel{Hi)) is a fixed 
point. Either way, Bel{9) must approach zero. This is understood from (8). 

As a belief threshold thr is introduced, its value together with the matching 
function (Fig. 1) and of course the incoming data, will determine if the threshold 
is reached, nimax and thr are the parameters varied. The curved parts of the 
matching function are chosen to have exponential shape. The exponent is a “re- 
lative distance to a match”, with an overall minus sign. For example, if a height 
h is found to exceed the maximum height hmax for a certain flight envelope, 
a penalty factor of exp{— ) multiplies rrimax- Each dimension of the 

^max ^min 

database elements possibly contributes with its own penalty factor, independent 
of the others. 
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4.4 Results 

The scenario consists of an agile aircraft coming in at high altitude and high 
speed. Maintaining speed and altitude it performs a heavy level turn, then leaves. 
The data in Table 2 is an attempt to mimic this. 



Table 2. The scenario data 



Time 


height (m) 


speed (km/h) 


g-load 


1 


17500 


2000 


1 


2 


17500 


2000 


1 


3 


17500 


2000 


1 


4 


17500 


2000 


1 


5 


17500 


2000 


3 


6 


17500 


2000 


6 


7 


17500 


2000 


9 


8 


17500 


2000 


9 


9 


17500 


2000 


6 


10 


17500 


2000 


3 


11 


17500 


2000 


1 


12 


17500 


2000 


1 


13 


17500 


2000 


1 


14 


17500 


2000 


1 



Figures 4-6 show the identification progress when threshold = rrimax = 
0.5, 0.7 and 0.9, respectively. (Thresholds below 0.5 are not allowed, since it could 
result in the identification process proceeding down both branches of a node in 
the BVH.) The plots show the highest belief in the current fd at time t, after 
including the (two pieces of) evidence governed at that time. Note that passing 
a threshold - dotted line - means a partial or full identification has been made. 
All dotted lines except the uppermost thus separates different fd’s, and therefore 
represents Bel{-) = thr in one fd and Bel{-) = 0 in the other (see figures). True 
identification has obviously been reached only for thr = rrimax = 0.5, see Fig. 4. 

The other two cases. Figs. 5 and 6, show similar results, where belief seem 
to have converged two levels beneath true identification. 

In order to reach identification also for the higher thresholds, rrimax was 
increased to 1.1-thr, see Figs. 7-9. The identification performance then changes 
dramatically for thr = 0.9, Fig. 9. This configuration is now the fastest to 
complete the identification. 

Surprisingly the performance is lowest, and has seemingly not changed at all, 
for the intermediate threshold 0.7, see Fig. 8. Intuitively, there seem to be no 
reason for such non-monotonicity in identification performance. This is further 
discussed in the concluding words, Sect. 5. 

Increasing rrimax further eventually led to full identification also for the in- 
termediate threshold. 
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Fig. 4. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr = nimax = 0.5, the Mirage 2000C is identified. Note that each 
dotted line (except the uppermost) separates two fd’s and thus represents two belief 
values 



0.7 



/ [F 16, Mig23B, Mig 29, Mirage 2000C, 
/ Su 27, Tornado IDS] "identified" 



Fig. 5. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr = rrimax = 0.7, the belief converges below the {F16, Mig29, 
Mirage 2000C}-threshold. Note that each dotted line (except the uppermost) separates 
two fd’s and thus represents two belief values 



0.9 



0 




/ [F 16, Mig 23B, Mig 29, Mirage 2000C, 

/ Su 27, Tornado IDS] "identified" 

0 ' ' ' ' ' ' ' ' ' ' ' ' ' ' 

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 



Fig. 6. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr = rrimax = 0.9, the belief converges below the {F16, Mig29, 
Mirage 2000C}-threshold. Note that each dotted line (except the uppermost) separates 
two fd’s and thus represents two belief values 
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Fig. 7. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr — 0.5 and nimax. = 0.55, the Mirage 2000C is identified. Note 
that each dotted line (except the uppermost) separates two fd’s and thus represents 
two belief values 




Fig. 8. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr — 0.7 and rrimax = 0.77, the belief converges below the {F16, 
Mig29, Mirage 2000C}-threshold. Note that each dotted line (except the uppermost) 
separates two fd’s and thus represents two belief values 




Fig. 9. The highest belief, in the current frame of discernment, after including evidence 
at time t. Here, with thr — 0.9 and nimax. = 0.99, the Mirage 2000C is identified. Note 
that each dotted line (except the uppermost) separates two fd’s and thus represents 
two belief values 
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5 Conclusions 

Using a fuzzy matching of data onto a database means a balancing between 
softness and sensitivity to data must be made. Translating the softness into an 
evidential reasoning framework would not resolve this. However, the framework 
presented provides two explicit parameters to investigate and control the sen- 
sitivity of the system. In a system of operation i.e. rrimax could be a runtime 
variable. This way the sensitvity of the identification could be related to the 
current situation. 

The case study lacks generality in several ways. Therefore, it should be con- 
sidered mostly a way to examplify some of the properties of the proposed fra- 
mework. The results are summarised below. 

For the parameter values tried, it is indicated that a threshold close to 1 
provides the largest range of sensitivity. The corresponding smallest range seem 
to occur for a threshold in the interior of [0.5, 1] rather than at the boundary 
thr = 0.5. This is found surprising. Perhaps the parameter ratio nimax/thr, 
which was kept constant when comparing identification performance for different 
thresholds, is not the relevant property. Instead, e.g. the square root of that 
property might bring back the monotonicity. To investigate this the mappings 
(9)have to be extended with the modulus function, and analysed. As this is 
complicated by the nonlinearity of the modulus function further simulations 
might also be needed. All this must be subject for future studies. 

As an alternative to the design above, the identification progress could be 
“stepped back” at some criterion, i.e. when Bel{6) becomes too small. Allowing 
revokable identifications is however considered dangerous when the result is the 
basis for unrevokable decisions, which is likely in the application above. A less 
sensitive, more “soft” identification should then be preferred. 
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Abstract. In Dempster-Shafer theory, belief degrees on any subset 
A C 1? of states of nature are computed through a basic belief mass 
m(A), that quantifies the strength of the statement: “the agent has some 
reason to believe that the true world is in A” . We may be interested into 
representing some negative belief, as for example: “the agent has some 
reason not to believe that the true world is in A” . However, as remarked 
by Smets, this is not allowed in the theory of Dempster-Shafer. Attempts 
to model this situation have been proposed by Smets, and by Dubois, 
Prade and Smets in the framework of possibility theory. These solutions 
however, do not seem to be able to handle all the facets of the problem, 
and have the drawback to come up with an interval or a pair of values. 
In this paper, we propose an alternative solution consisting in assigning 
a single number to a pair of events. 



1 Introduction 

In Dempster-Shafer theory [7], belief degrees on any subset A C 12 of states 
of nature are computed through a basic mass (of belief) allocation function 
m : 2^ — [0,1], which assigns values to specific subsets of 17, called focal 
elements. The mass function is such that iti{A) = 1, and m(A) = 0 if A 

is not a focal element. Then the belief function Bel : 2^ — [0, 1] is given by 
Bel(A) := 

The mass function m represents the basic knowledge, or belief, of some agent 
on the true (but unknown) state of nature (or world). The value m(A) quantifies 
the strength of the statement: “the agent have some reason to believe that the 
true world is in A”, based on some information concerning solely A. We may 
be interested into representing some negative belief, as for example: “the agent 
has some reason not to believe that the true world is in A”. But, as remarked 
by Smets [8], this is not allowed in the theory of Dempster-Shafer. The reason 
is that there is no inverse mass function for the Dempster rule of combination. 
The following example from Smets [8] shows that in some situation we may need 
such notion of negative belief. 
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The Ukalvia Example. You are told that a newspaper reports that 
the economic situation in Ukalvia is good. You have never heard about 
Ukalvia before, so you may start to believe the information. Later, you 
discover that the newspaper is controlled by the unique authorized party 
in Ukalvia. So you think that it may be propaganda, and you may have 
now some reason not to believe that the economic situation is good. 

Attempts to model this situation have been given by Smets [8] by means of a 
pair of functions called confidence and diffidence, and by Dubois, Prade and 
Smets [4] in the framework of possibility theory, using also a pair of functions, 
which represent a guaranteed possibility and a (usual) possibility degree. These 
solutions however, do not seem to be able to handle all the facets of the problem, 
and have the drawback to come up with an interval or a pair of values. In this 
paper, we propose an alternative solution based on bi-capacities [6,5]. As it will 
be seen, our solution is in some sense the converse of the previous ones: instead 
of assigning a pair of numbers to an event, we assign a single number to a pair 
of events. 

In the sequel, the finite universal set (states of nature) will be denoted 17, of 
cardinality n. 



2 Background 

Let Q(17) := {{A,B) G 2^ x 2^|A n S = 0}. It is easy to see that Q(17) is a 
lattice, when equipped with the following order: {A, B) C (C, D) if A C C and 
BAD. Supremum and infimum are respectively 

(A, B) U (C, D) = (ADC, BHD) 

{A, B) n (C, D) = {AaC,B\J D). 

Top and bottom are respectively (17, 0) and (0, 17). We call vertices of Q(17) any 
element {A, B) such that AU B = H, since they coincide with the vertices of 
[0, 1]”. We give in figure 1 the Basse diagram of (2(17), G) for n = 3. 

In [2], Bilbao et al. introduced other operations on 2(17), which are: 

{A, B) C' (C, D) if A C C and B C D 

(A, B) U' (C, D) := ((A U C) \ (B U D), (B U D) \ (A U C)) 

(A,B) n' {C,D) := {A AC, BAD). 

However, (2(17), U', □') is not a lattice since e.g. for any A C 17, (A, A'^) U' 
(A^A) = (0,0) but (A, A") g' (0,0) since (A,A<^) U (0,0) = (A,A^) ^ (0,0). 
This justifies the choice of operations C , U and □ instead of U' , U' and 

Following usual conventions, 2(17) is the lattice called 3". For any ordered 
pair ((A, B) , {AA D , B\C)) of 2(17) with CAB and D C (17\(AUi3))UC, the 
interval [(A, B), (AA D, B\ C)] is a sub-lattice of type 2^ x 3*, with k = \CAD\, 
and I = \C A D\. 

We recall some definitions and a fundamental result of lattice theory [3]. 
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Fig. 1. The lattice Q{ fl) for n = 3 



Definition 1. Let {L,\Z) he a lattice. Its bottom is denoted by _L. An element 
X G L is U-irreducible if x ^ 1. and x = aUb implies x = a or x = b, Va, b € L. 

In a finite lattice, x is U-irreducible if it has only one predecessor. It is easy to 
see that the U-irreducible elements of Q(l7) are (0,z°) and (i,i‘^), for all i € fl. 
On Figure 1, U-irreducible elements are indicated by black circles. The main 
interest of U-irreducible elements lies in the fact that every element in Q{Q) can 
be written as supremum over these elements [6,5] : 

(A,B)= □(^,OU □ (0,/) . (1) 

ieA ieS” 

The U-irreducible elements are (^^,0) and (f^,z), for all i G fi. Every element in 
Q(0) can be written as infimum over these elements [6,5] : 

(AB) = .n (^^0)^ no-j). (2) 

jGB 

It may be more natural to introduce the dual structure of Q{f2) by replacing 
an element {A, B) G Q(l7) by {A,B^). We thus obtain := {{A,B)\A C 

B}. 
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Bi-capacities in [6,5] were defined as follows, v : Q{fl) — >■ K is a bi-capacity 
if v{^, 0) = 0, and AC B implies v{A, ■) < v{B, •) and v{-, A) > v{-,B) (isotonic- 
ity). If in addition v{f2, 0) = 1 and f2) = —1, v is said to be normalized. 

In terms of ordered sets, v is an order-preserving mapping from (Q(I?),E) 
to ([— !,!],<), preserving also top and bottom, and with a fixed point (0,0). 
This definition was motivated by multicriteria decision making on bipolar scales, 
hence the interval centered around 0, and the fixed point. We may have a slightly 
more general view, by dropping the fixed point restriction and allowing any 
totally ordered set with top and bottom instead of the interval [—1, 1]. Indeed, 
in the present case, we are dealing with uncertainty, where the usual scale is 
rather [0,1] (unipolar), and we have no reason to consider some central point 
1/2 or whatsoever. Hence, we adopt the following definition. 

Definition 2. A unipolar bi-capacity is a function v : Q(i7) — > [0, 1] such that 
v(f2, 0) = 1, v(0, f2) = 0, and isotone. 

In the sequel we drop the term “unipolar” as far as no confusion arises. 

A fundamental notion for bi-capacities is the Mobius transform. It is shown 
in [6,5] that its expression is, for any function v on Q(l7): 

m(A,A')= ^ (-1)I'^\'®I+I^'\^\(H,H') 

B'n.4=0 



and conversely. 



v{A,A')= Y. rn{B,B'). (3) 

{B,B')Q{A.A') 

For unipolar bi-capacities, normalization conditions write: 

Y B) = l (4) 

(A,B)gQ(J7) 

m(0, 17) = 0 (5) 

By analogy with classical definitions, a bi-capacity is said to be k-monotone 
{k > 2) if for all families of k elements (Ai, Bi ), . . . , (Ak,Bk) in Q(l7), 

k 

v{\_\{A,B,))> Y (6) 

05^/C{l.... ,fe} 

For k = 2, the definition reduces to 

v{{A, B) U (C, D)) + v{{A, B) n {C, D)) > v{A, B) + v{C, D) 

and V is said to be supermodular or convex. A bi-capacity v is said to be k- 
alternating when v satisfies the reversed inequality with U and □ inverted. 2- 
alternating bi-capacities are said to be submodular. When v is fc-monotone (resp. 
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alternating) for any k > 2, then v is said to be totally monotone (resp. alter- 
nating). 

The conjugate of a (unipolar) bi-capacity is defined by 

v{A,B) = l-v{B,A). 



3 Bi-belief Functions 

3.1 Mathematical Definition 

Let V be a function on Q{f2) such that u(0, 17) = 0 and u(l7, 0) = 1, and let m 
be its Mobius transform. By analogy with the classical case, we say that u is a 
bi-belief function if for any (A,B) G Q(l7) we have m(A, B) > 0. Clearly, the 
non-negativity of m implies that v is isotone, hence v is a unipolar bi-capacity. 
We denote bi-belief functions by Bel. 

If u is a bi-belief function, then the conjugate bi-capacity v is called a bi- 
plausibility function, which we will denote by PI. We have: 

Pl{A,B) = l-Bel{B,A),'i{A,B) G Q(C). (7) 

If we stick to the usual vocabulary of Shafer, elements of Q(17) where the 
Mobius transform is strictly positive are called focal elements, and m could be 
called the mass allocation or simply mass. By (3), (4) and (5), we can express 
bi-belief and bi-plausibility functions in terms of mass: 

Bel{A,B)= m{C,D)= Y m{C,D) 

(C,D)Q(A,B) CQA, DDB 

Cnn=0 

Pl{A,B) = l- Y MC,D)= Y MC,D) 

(C,D)mB,A) (C,D)^{B,A) 

Y m{C,D)+ Y m{C,D) 

Cns°5^0. DDA DUA'^^a, C<ZB 

CnD=Hi CnD=<6 

-\- Y m{C,D) . 

cns‘=/0, D\jA'^=^n 
cnD=0 

Note the analogy with classical formulas. The following property shows the 
similarity with the classical case. 

Proposition 1. A (unipolar) bi- capacity is totally monotone (resp. alternating) 
if and only if it is a bi-belief (resp. bi-plausibility) function. 

The proof comes from a general result by Barthelemy [1], proving that when 
the Mobius transform of some function / on a finite lattice is non negative, / is 
monotone and totally monotone. The case of bi-plausibility is obtained dually. 



(8) 



(9) 
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3.2 Interpretation 

After having given a mathematical definition which seems to keep in a more 
general framework all usual properties of classical belief functions, we focus on 
interpretation and go back to our main motivation. 

Generally speaking, the quantity Bel(A, B) represents the degree to which 
the agent believes that A contains the true state of nature (say Wq) and B does 
not contain it. The remaining part (AUB)^ is the “ignorance” part. For classical 
belief functions, Bel(A) is the degree to which the agent believes that A contains 
the true state loq, or equivalently that A° does not contain ojq, hence correspond- 
ing to Bel(A, A'^). We have here much more flexibility, and the degree of belief 
about whether A contains Wq is expressed by all the quantities {Bel(A, B)}bca‘=- 
Due to isotonicity, all these quantities are ordered along chains from (A, 0) to 
(A,A-): 



Bel(A, 0) > Bel(A, {i}) > Bel(A, >■■■> Bel(A, A“) 

for any in A°. Note that {A,%) is the less “demanding” event since 0 

cannot contain the true state whatsoever, while (A, A“) is the most demanding 
event. In the classical view, all these quantities collapse to a single one (we 
will properly show this below). Conversely, let us consider for a given B all the 
quantities {Bel(A, B)}^cb<=- They are also partially ordered along chains from 
(B^B) to (0,B): 

Bel(B^ B) > Bel(B" \ {i}, B) > Bel(B“ \ {i, j}, B) > ■ ■ ■ > Bel(0, B) 

for any i,j, ... in B. The event (0, B) is a situation when there is no evidence 
that Wo could be in some subset of 17, but we have some evidence that wq is not 
in B. 

With this interpretation in mind, we speak of a purely positive or confidence 
belief with events of the form (A, A°), while purely negative or diffidence belief 
are events of the form (0, B). 

What about mass allocation ? Let us stick first to the view of Smets [8] . We 
should define a confidence mass m~^ and a diffidence mass m~ with the meaning 
explained in the introduction: to"*" (A) quantifies the statement “the agent have 
some reason to believe that the true world is in A”, while m“(A) quantifies “the 
agent has some reason not to believe that the true world is in A”. Now, we can 
merge this together into a single mass function on Q(l7). We put accordingly to 
the above interpretation m(A,A°) := m+(A), and m(0. A) := m~{A). We have 
thus recovered the model sought by Smets, and we are even more general, since 
in principle we could define m on any element of Q(17). 

Using the dual structure another interesting interpretation comes up, 

which will correspond to the view of Dubois et al. [4]. To switch from Q(l7) to 
Q*{f2), it suffices for an element {A, B) of Q(l7) to turn B into Then in 
Q*{^2), an element (A, B) is such that A C i?. In the framework above inspired 
from Smets, such an event could be interpreted as: B (certainly) contains the 
true state, while A may contain it (since A'^ certainly does not contain it). 
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3.3 Relation with the Classical Belief Model; The 
Confidence/DifRdence Model 

The classical belief model is embedded into our general framework. According to 
the interpretation given above, a classical mass allocation corresponds to a con- 
fidence mass , without any diffidence component. Hence, the corresponding 
Mobius transform m is non null only for elements of the form (A, A'^) in Q(17). 
The consequence is that, as claimed above, Bel(A, H) depends no more on B : 

Bel(A,H) = m{C,D) = ^ m(C,C“) . 

(C,D)C(A,B) CCA 

Consequently, P1(A, B) does not depend on A. 

Symmetrically, we could imagine a situation where only diffidence mass is 
given, so that the corresponding Mobius transform is non zero only for elements 
of the form (0,H). In this case, 

Bel(A,B) = to(0,C) 

C^B 

SO that Bel(A, B) does not depend on A (and P1(A, B) no more on B). 

The confidence/diffidence model appears as a particular case of the bi-belief 
model easier to handle. Let us remark that in general, are not usual 

mass allocations since: 

— m“''(0) = 0, and represents ignorance, but 1 in 

general. 

— m~{Q) = TO+(0) = 0, m~{^) represents ignorance (of the diffidence part), 
and again X)aci 7 ni“(A) 1 in general. 

However, note that 

'Y nrL^{A) + Y^ m~{A) = 1. 

A<zn A<zn 

This allows a free balance of the confidence and diffidence parts (which would 
have been impossible with bipolar bi-capacities), including the classical belief 
model, and a symmetric purely diffidence model. The quantities nr“''(A) 

and rn~{A) represent the “weights” of the diffidence and confidence parts 

(if they come from two different sources, they may be simply the confidence 
degree we have in these sources). 

Combining both the confidence and the diffidence, we have 

( m+(A) if i? = A'^ 
m{A,B) = I m-{B) if A = 0 
I 0 otherwise 



Bel(A,H) = Y w+(C) + Y • 

CCA CDB 



and 
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3.4 The Ukalvia Example Revisited 

Let us try to apply the confidence/diffidence model to the Ukalvia case, with 
different balances between the two parts of the model. 

The possible states of nature are: the economic situation is good (G), or it 
is bad (B), hence 17 = {G,B}. The confidence part is given by the statement: 
“the agent believes to some extent that the economic situation is good”, which 
can be modeled by 

m^{G) = a, = 0, (G) + m~^ {{G , B}) = a~^ 

and — a = m~^{{G,B}) represents ignorance of the confidence part. Now 
the diffidence part is expressed by the statement: “the agent does not trust (to 
some extent) the journal”, hence he/she does not believe to some extent that 
the situation is good. This can be modeled by 

m“(G)=/3, m~{B) = 0, m~ (G) + m~ (0) = P~ 

where again (3~ — (3 = m“(0) represents ignorance. Figure 2 illustrates the 
situation (masses equal to 0 are not figured). We can compute explicitly all 
beliefs: 




Fig. 2. The Ukalvia example modeled by bi-belief functions 
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Bel(G, 0) = m+(G) + m-{G) + m-(0) = a + P~ 

0) = m" (G) + m" (0) = /3" 

Bel(G,S) =m+(G) = a 
Bel(0, 0) = m“(0) + m~{G) = (3~ 

Bel(B,G) =to-(G) =/3 

Bel(0, B) = Q 

Bel(0, G) =m-(G) =/3 

It is interesting to give various values to the masses, so as to model typical 
situations, e.g. 

— situation 1: the agent has not heard about the origin of the iournal: a = 0.8, 
a+ = l, /3 = 0, /3-=0. 

— situation 2: the agent trusts more the journal: a = 0.6, = 0.7, [3 = 0.2, 

f3~ = 0.3. 

— situation 3: the agent feels confused (equal weight for confidence and diffi- 
dence): a = 0.4, q;+ = 0.5, l3 = 0.4, j3~ = 0.5. 

— situation 4: the agent does not trust so much the iournal: a = 0.2, a+ = 0.3, 
/3 = 0.6, (3- = 0.7. 

— situation 5: the agent, knowing from the beginning the origin of the journal, 
does not trust the information: a = 0, a''" = 0, /? = 0.8, (3~ = 1. 

We obtain the following results, which fits the intuition. 



situation — >■ 


1 2 3 4 5 


Bel(G, 0) 
Bel(R,0) 
Bel(G,R) 
Bel(0,0) 
Bel(R, G) 
Bel(0,R) 
Bel(0, G) 


0.8 0.9 0.9 0.9 1 
0 0.3 0.5 0.7 1 
0.8 0.6 0.4 0.2 0 
0 0.3 0.5 0.7 1 
0 0.2 0.4 0.6 0.8 
0 0 0 0 0 
0 0.2 0.4 0.6 0.8 



Note that, along the agent thinks more and more that it is propaganda, 
the belief for (G, B) (the true situation is G, and B is not the true situation) 
decreases, while belief for {B, G) increases. 

4 Bi-possibility Measures 

As possibility measures play a particular role in capacity theory, we try to define 
here the corresponding concept for bi-capacities. It is well known that necessity 
and possibility measures are particular cases of belief and plausibility functions, 
when focal elements are nested. 

Results are relatively easy to generalize when all is expressed into the lan- 
guage of ordered sets, and in particular lattices. Nested focal elements form in 
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fact a maximal chain from bottom to top of the lattice, so we keep the same 
notion for Q{Q). A maximal chain in Q(l7) is a sequence of elements starting 
from (0, 17) and finishing at (17, 0), and between two consecutive elements, either 
i is deleted from the right argument, or j is added to the left argument, provided 
j is not present in the right argument. 

We say that a bi-belief function is a hi-necessity measure if the focal elements 
form a maximal chain in Q(l7). A bi-possibility measure is the conjugate of 
a bi-necessity measure. We denote by U and N bi-possibility and bi-necessity 
measures. 

The following result holds. 

Proposition 2. Let II and N be any bi-possibility and bi-necessity measure. 
Then for any (A, B), (C, D) in Q(l7), 

iI((A, B) U (C, D)) = n{A, B) V n{C, D) 

N((A, B) n (C, D)) = N(A, B) A N(C, D). 

The proof is based on a general result by Barthelemy [1]. Observe that these 
properties extend those known for the classical possibility and necessity mea- 
sures. 

Since any element in Q(l7) can be expressed with the help of sup-irreducible 
or inf-irreducible elements by equations (1) and (2), we can write: 

n{A,B)=\! TT+{i)y \J 7r"(j) 

N(A,B) = /\ (l-7r+(t)) A f\ (l-7r-(j)) 

= A ^ A 

ieA” j^B 

putting 7T+(i) := 7T(i,i°), 7r“(z) := II{il),i‘^), and n+(z) := N(z°,0), n~{i) := 
N(z°,z). These functions could be called distributions as in the classical case. 
More precisely, the pair (7r+,7r“) is the bi-possibility distribution (idem for bi- 
necessity distribution). The name “distribution” is justified since the value of 
7T or N at any point of Q(l7) can be recovered from the distribution. We call 
TT'*', 7T“ the left and right distribution functions respectively (idem for necessity). 

Let us express the relation between the distribution and the Mobius trans- 
form. We assume first that the Mobius transform is given, i.e. we know the 
maximal chain C and values of m on it. Any maximal chain is a sequence of 
2n-|- 1 elements starting with (0, 17) and finishing with (17, 0). A convenient way 
of denoting a chain is to write the 2n length sequence of elements of 17 which 
are deleted or added, with a “-I-” sign if added, and a ” sign if deleted. For 
example for rz = 3, the maximal chain 



( 10 ) 

( 11 ) 

( 12 ) 



{(0, 123), (0, 12), (0, 1), (2, 1), (2, 0), (12, 0), (123, 0)} 
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is denoted — 3, — 2, 2, — 1, 1, 3. By construction, each index appears only once, 
with each sign, and a positive index cannot be before the corresponding neg- 
ative index. This sequence is denoted ac, and is a permutation on {—n,—n + 
1,... ,-1,1,2,... ,n}. 

Let us compute the possibility distribution for a given Mobius trans- 
form m living on a chain C. Let denote the permutations on 17 defining 

the ordering of the positive and negative indices in C. With these notations, the 
following holds. 

Proposition 3. For a given Mobius transform m living on a chain C, the pos- 
sibility distribution (7r+,7r“) is given by: 

7T+(cr"(A:)) = 1 - ^ m{C,D) 

(C.D)ecn[(0,r2),(.,r3\{o-(i).... ,o-(fe)})[ 

TT~ {a+ (k)) = 1 - m{C,D) 

(C.D)GCn[(0.r2),({<T+(i)....,a+(fe)},.)[ 

The sequences 7r+ and tt~ are ordered w.r.t. a~ and respectively, and more- 
over the sequence is ordered w.r.t. ac, specifically : 

1 = 7r+((Tc(— n)) > n -I- 1)) > • • • 

> > • • • > 7r“(crc(n)) = m(l7, 0) . 



Applying Proposition 3 to our example above, we have: 

7T+(3) = 1 

7t+(2) = l-m(0, 12) 

7t'''( 1) = 1 — m(0, 12) — m(0, 1) — m(2, 1) 

7t“(2) = 1 — m(0, 12) — to(0, 1) 

7t“(1) = 1 — m(0, 12) — to(0, 1) — m(2, 1) — m(2, 0) 

7t“(3) = 1 — m(0, 12) — m(0, 1) — m(2, 1) — m(2, 0) — (12, 0), 

and 1 = 7 t+(3) > 7t+(2) > > 7r+(l) > 7 t“(1) > 7t“( 3). Since the equation 

system is triangular, one easily recovers m from a given distribution, the chain 
of focal elements being defined by the distribution. 
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Abstract. The main question addressed in this paper is how to 
represent belief functions independencies by graphical model. Directed 
evidential networks (DEVNs) with conditional belief functions are then 
proposed. These networks are directed acyclic graphs (DAGs) similar 
to Bayesian networks but instead of using probability functions, we 
use belief functions. Directed evidential network with conditional belief 
functions has the advantage of providing an appropriate representation 
of the knowledge that can be produced as conditional relationships. 

Keywords: belief functions, transferable belief model, conditional 
belief functions, directed evidential networks 



1 Introduction 

In this paper, we discuss some aspects related to uncertainty^ representation in 
directed evidential networks with conditional belief functions. Pearl starts with 
conditional independence relationships when building his prohahilistic graphical 
model where conditional probabilities can be directly manipulated using Bayes’ 
theorem (Pearl, 1988). The graphical model is considered as a picture that pro- 
vides an intuitive description of the problem. It is also considered as a mathemat- 
ical structure that specifies the different connections between the variables of a 
problem transforming a complex problem into an easily and clear representation. 

However, in the networks using belief functions, the relations among the 
variables are generally represented by joint belief functions (Shafer et ah, 1987) 
rather than conditional belief functions. Nevertheless, the use of graphs to rep- 
resent conditional independence relations is useful since an exponential number 
of conditional independence statements can be represented by a graph with a 
polynomial number of vertices (Shenoy, 1993). So, we will be interested, in this 
paper, by showing how to represent independencies by directed evidential net- 
works with conditional belief functions. 

^ Uncertainty is expressed by belief functions as understood in the context of the 
transferable belief model (TBM). 
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The remainder of this paper is organized as follows. In Sect. 2, we present 
some useful definitions and notations needed for belief function context. Next, 
before presenting and discussing the belief function networks, we focus on the 
study of directed acyclic graph as a general graphical model that displays quali- 
tatively the dependence relationships, under any assignment of numerical values 
(Sect. 3), and then briefly present some well-known graphical representations 
(Sect. 4), namely Bayesian networks (BN), valuation networks (VN), and evi- 
dential networks with conditional belief functions (ENC), these types of networks 
seem to be a good start point for our work. Then, Sect. 5 is devoted to present 
the directed evidential networks (DEVN) with conditional belief functions, we 
especially discuss the links between the directed evidential networks and the 
other networks (Bayesian networks and valuation networks). 

2 Definitions and Notations of Belief Functions 

When we model aspects of the real word, we often deal with multivariate situ- 
ations where the state space is a product space. Therefore, multivariate belief 
functions theory turn out to be well suited for modelling real world problems. 
We present below some definitions necessary when belief functions are used. 

2.1 Variables 

Let U = {X,V,Z,...} be a set of finite variables, 0x = be the 

domain relative to the variable X (with a finite cardinality n), and x represents 
any instance of X. For simplicity sake, we denote Ox by X, Qy by Y ... Let 
12 be a frame of discernment (Shafer, 1976). It is the Cartesian product of the 
domains of the variables in U. For a subset of variables A CXJ, the frame for A, 
denoted by Oa, is the Cartesian product of the frames for the variables in A, and 
its elements are called the configurations of A. For example, X x Y represents 
the product space of variables X and Y, and when there is no ambiguity, it is 
simply denoted by XY. The elements of X (Y . . . ) are represented by indexed 
variables like Xi {yj . . . ) whereas x {y . . . ) denote subsets of X {Y .. . ). For 
X C X and y Q Y, (x,y) is defined by (x,y) = {{xi,yj) : Xi G x,yj G y}, 
and similarly for (x, y,z) ... . Extension and projection of sets of configurations 
are very important in belief functions theory. Therefore, let’s define these two 
operations: 

Definition 1. Cylindrical Extension. For x C X, is the cylindrical 

extension of x on XY : = (x,Y). 

Definition 2. Projection. For w C {2, is the projection of w on X: 

_ jj,. : Xi G X, x}^ n w 0}. 

We give now the definition of independent variables. 

Definition 3. Independent Variables. Two variables X and Y are indepen- 
dent iff {xi,yj) yf 0, Vxj G X,yj G Y. 
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2.2 Belief Function Conditioning 

Let X denote the background knowledge that holds and that underlies the beliefs. 

In X, we find the classical conditioning events. 

— We use the notation m^[x\ to represent the bba (shorthand for basic belief 
assignment) m defined on the domain Q given the belief holder knows (ac- 
cepts) that X is true (i.e. x holds). The term m can be replaced by bel,pl, q 
in order to denote the belief function, the plausibility function and the 
commonality function. The values taken by these functions at w C 17 are de- 
noted by m^[x]{w), bel^[x]{w), pl^[x]{w), g^[a;](w), respectively. m^[x]{w) 
is called a basic belief mass (bbm). bel^[x] is called a conditional belief func- 
tion. It can be seen as a vector in a dimensional space. Classically, it 
was denoted as bel^{. \ x), but the bracket notation turns out to be more 
convenient. The 17 superscript will not be mentioned when there is no risk 
of confusion. 

— Given a belief function bel^ on 17, let 

b^{w) = bel^{w) C 17. 

This function b^ is called the implicability function. In practice, the b func- 
tion is much more convenient to work with than the bel function. 

— bel^[x\{w) denotes the value of bel^[x\ at w C 17. It represents the belief of 
w given x. When x is the proposition that states that the actual value of 17 
belongs to y C 17, its value is given by 

b^[y\{w) = b^{w\Jy), 
bel^[y]{w) = bel^{w U y) — bel^{y), 
pl^[y]{w) =pl^{wny), 

{ q^{w) if C y, 

0 otherwise. 

These are the so called Dempster’s rule of conditioning (except for the nor- 
malization factor). 

2.3 Combination 

The © symbol represents Dempster’s rule of combination in its normalized form 
and © represents the conjunctive combination, i.e., the same operation as Demp- 
ster’s rule of combination except that the normalization is not performed. 

The conjunctive combination rule (as well as its Dempster’ form) are appli- 
cable to combine the belief functions produced by distinct pieces of evidence and 
can be written equivalently as: 

TOi(n)2(w) = TOi©TO2(w) = ^ TOi ('U;i)m2(w2) 

Wi^W2^^,Wir\W2—W 

pl\@pl2 is the plausibility function obtained from toi©TO 2 where m\ and m2 
are the bba’s related to pl\ and ph, respectively (and similarly with bel and q). 
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2.4 Marginalization 

The belief function is the marginal of bel^ on X. In particular, we have: 

beF^^^{x) = bel^^{x,Y), 

pF^^^{x)=pF^{x,Y). 

Note that conditioning and marginalization do not commute, so the order of 
the symbols is important. bel^^[y]^^ is the belief function obtained by condi- 
tioning bel^^ on y and the result is then marginalized on X. 

2.5 Ballooning Extension 

Let X and Y be two independent variables, and let bel^ [j/i] be a conditional belief 
function defined on X for y^ G Y. The ballooning extension of the conditional 
belief function, denoted , is the belief function defined on XY which bba 

satisfies (Smets, 1978): 



[Vi] (x) if w = (x, j/i) U (X, yi) , 

( 1 ) 

0 otherwise. 

The belief function obtained by ballooning extension is the least committed 
belief function among all the belief functions defined on XY which conditioning 
on (X, yi) (followed by the trivial marginalization on X in order to be defined 
on X and not on XY) reproduces bel^[yi] (Shafer, 1982). 

3 Conditional Independence in Directed Acyclic Graphs 

Developing a graphical representation for judgments about independencies facil- 
itates a qualitative organization of knowledge. Geiger et al. have investigated the 
problem of determining exactly what independencies are implied by the structure 
of the DAG in a causal network^ (Geiger et ah, 1990). 

In this section, we give the definition of the graphical criterion identifying 
conditional independencies in DAGs, called d-separation. First, let’s look at 
the preliminary definitions which are useful for further interpretations. 

3.1 Dependency Models 

The notion of dependency models presented here was originated by 
(Pearl and Paz, 1987). Formally, a dependency model M over a finite set of el- 
ements V is defined as any subset of triplets {X, Z, Y) where X, Y, and Z are 

^ Causal network consists of a set of variables, and a set of directed 
links between variables. Mathematically, the structure is a directed graph 
(Jensen and Lauritzen, 2000). 
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three disjoint subsets of V. The triplets in M represent independencies, that is, 
{X, Y,Z) G M asserts that is independent of Y given Z” . This statement is 
called an independence statement and is written as I{X,Z,Y) (Pearl, 1988). 

The intended interpretation of I{X, Z, Y) is that when we observe Z, no ad- 
ditional information about X could be obtained by also observing Y, or equiv- 
alently X and Y interact only via Z. In a belief network, the independence 
statements are important because they reduce the complexity of inference. 

A graphical representation of a dependency model M is a one-to-one corre- 
spondence between the elements in M and the set of vertices in a graph G. 

3.2 d-Separation Criterion 

The study of the concept of conditional independence in probability theory has 
resulted in the identification of several properties that may be reasonable to 
demand of any relationship which attempts to capture the intuitive notion of 
independence. These properties are called graphoid axioms (i.e. symmetry, de- 
composition, weak union, contraction, and intersection axioms). 

Interestingly, directed acyclic graphs conform to the graphoid axioms if we 
relate the independence statement I{X, Z, Y) in M with the graphical condition 
’’every chain from X to y is blocked by the set of nodes Z” (Pearl, 1988). From 
this, we can define the d-separation criterion. 

Definition 4. d- Separation. Given a DAG G = (V, E). Two subsets of nodes, 
X and Y, are said to be d-separated by Z, denoted by < X \ Z \ Y >c, if all 
chains between the nodes in X and the nodes in Y are blocked by Z. 

Intuitively, d-separation reflects a basic property of sound human reasoning, 
and therefore any uncertainty computation in causal networks should satisfy the 
principle that whenever X and Y are d-separated then new information about 
one of them does not change the certainty of the other when the Z variables are 
fixed. 

4 Graphical Representations 

We first consider two well-known frameworks for graphical representation of 
uncertain knowledge: Bayesian networks (Pearl, 1988) and valuations networks 
(Shenoy and Shafer, 1990). Bayesian networks are used for the probabilistic in- 
ference, while valuation networks represent several uncertainty formalisms in a 
unified framework. Then, as in the valuation networks, it is not possible to rep- 
resent knowledge in conditional form, (Xu, 1995) proposes a new network for the 
belief function inference called evidential network with conditional belief function 
(ENC). 

4.1 Bayesian Networks 

We focus on the possibility of using DAGs as graphical representation for prob- 
abilistic models. Indeed, Pearl defines a Bayesian network as a directed acyclic 
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graph (DAG) in which the nodes represent (random) variables, the arcs signify 
the existence of direct causal influences between the linked variables, and the 
strengths of these influences are expressed by conditional probabilities. Formally, 
a Bayesian network can be regarded as a triplet V, E, P where: 

~ V = {Xi, ..., A„} is a set of variables, for each variable Xi, we associate a 
frame Oxi representing a set of all its possible instances, 

— E is a set of arcs over V such that the pair (V,E) is a DAG where the nodes 
represent the variables and arcs represent conditional dependency relations 
among the variables, 

— P = {P{Xi I Xi^, ...,XiJ : Xi gV, Xi- G Pa{Xi)} is a set of assessment 
functions defining conditional probabilities of the variables given their par- 
ents, stored at node Xi. When Xi has no parents, P{Xi) represents the prior 
probability of Xi. 



Probabilistic Chain Rule. A fundamental assumption of a Bayesian network 
is that when we multiply the conditionals for each variable, we get a unique 
joint probability distribution for all variables in the network that agrees with 
the independencies represented by the network structure^. The joint probability 
of a Bayesian network is given by the following chain rule: 

Definition 5. Probabilistic Chain Rule. Let BN be a Bayesian network over 
V={Xi, ...,Xn}, then the joint probability distribution P(V) is the product of all 
conditional probabilities specified in BN : 

P(Ai,...,A„) = nP(W I Pa(Xi)) (2) 

i 



where Pa{Xi) is the parent set of Xi. 



4.2 Valuation Networks 

Valuation networks are another well-known framework for the graphical repre- 
sentations of uncertain knowledge. They are graphical depiction of valuation- 
based systems (VBS) that can represent several uncertainty formalisms (proba- 
bility theory, possibility theory, belief function theory, ...) in a unified framework. 
Valuation networks have been originally proposed by (Shenoy and Shafer, 1990). 
Formally, a valuation network can be regarded as a 3-tuple {K,{0Xi}xiex, 
{Vi, ...,Vm}} with operators {©,4-} where: 

— X is a set of variables representing the universe of discourse, 

~ {Oxi} is the set of frames associated with each variable Xi, 

— {Vi, ...,Vm} is a, collection of valuations defined on the subsets of variables, 

® Due to the independencies, far fewer probabilities need to be specified than with an 
exhaustive list of the joint probability distribution (Pearl, 1988). 
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— © is the combination operation. Intuitively, combination corresponds to the 
aggregation of knowledge, 

^ 4 , is the marginalization operation. Intuitively, marginalization corresponds 
to the coarsening of knowledge. 

Graphically, there are two types of vertices in valuation networks (VNs) . One 
set of vertices represents variables, denoted by circles, and the other set repre- 
sents valuations, denoted by diamonds. In VNs, there are edges only between 
variables and valuations. There is an edge between a variable and a valuation if 
and only if the variable is in the domain of the valuation. 



Valuation Networks vs. Bayesian Networks. Valuation networks (VNs) are 
general graphical representations because they can represent several uncertainty 
formalisms in a unified framework, while Bayesian networks (BNs) have been 
proposed as graphical representations of probabilistic models. 

Graphically, a valuation network is a hypergraph and a Bayesian network is 
a directed acyclic graph. In a BN, arcs describe conditional dependence relations 
among the variables, whereas in a VN, such relations are represented by joint 
valuations on the product space of the involved variables. Finally, in VNs, two 
valuations can bear on the same variables, whereas in BNs, it is not allowed that 
two nodes are directly connected by two arcs. A comparison of the Bayesian 
network and the valuation network is given in Table 1. Notice that we can 
obtain a valuation network from a Bayesian network, but the reverse is usually 
not possible. 



Features 


Bayesian network 


Valuation network 


Graphical structure 

1. Type of graph 

2. Definition of relations 

3. Nodes 


Directed acyclic graph 
Based on conditional 
independence 
Random variables 


Hypergraph 
Joint form 

Variables 
and valuations 


Inference Procedure 

4. Type of uncertainty 

5. Inference process 


Probabilistic 

Quantitative based on 
probability propagation 


Several uncertainty 
formalisms 

Quantitative based on 
fusion algorithm 



Table 1. A comparison of Bayesian networks and valuation networks 



Valuation Networks and Conditional Independence. In this section, we 
show how to represent conditional independence relations in VNs. Indeed, in 
VNs, there are edges only between variables and valuations. If a valuation is a 
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conditional for X given Y , then a directed edge between the conditional valuation 
and variables in X is drawn, this edge is pointed toward the variables in X. 

For belief function context, (Cano et ah, 1993) and (Shenoy, 1994b) have 
both defined the concept of a conditional belief function, but their definition 
is different from ours definition of conditional belief function. To avoid confu- 
sion with our definition, we change the name they use and we call it joint belief 
function with a vacuous marginal: 

Definition 6. (Cano et al., 1993)'^ Given two independent variables X and Y. 
Let bel be a belief function defined on the product space Oxuy- It is said that bel 
is a joint belief function with a vacuous marginal over X if and only if bel^^ is 
a vacuous belief function over X. 

Intuitively, the joint belief function with a vacuous marginal means that if bel 
is a belief function on Oy conditioned on Ox, then it may gives some information 
about variable Y and their relationships with variable X, but no information 
about X. Xu has shown that this property can be easily verified when the belief 
is represented in normalized conditional form (Xu, 1995). 

Following (Shenoy, 1994b), valuation networks explicitly depict a factoriza- 
tion of the joint valuation. Since there is a one-to-one correspondence between 
a factorization of the joint valuation and the conditional independence relation 
that holds in it, valuation networks also explicitly represent conditional relations. 

Shenoy has also defined conditional independence in valuation networks and 
shown that it satisfies the graphoid axioms (Shenoy, 1994a). As the valuation- 
based systems (VBS) is a general framework, thus the graphoid axioms are also 
satisfied by the conditional independence relations in all uncertainties that fit in 
the VBS framework including probability theory, belief function theory and pos- 
sibility theory. For further information, the reader is referred to (Shenoy, 1993; 
Shenoy, 1994b; Ben Yaghlane et al., 2002; Ben Yaghlane, 2002). 

4.3 Evidential Networks with Conditional Belief Functions 

Evidential networks^ with conditional belief functions, called ENC, was originally 
proposed by Smets for the propagation of beliefs (Smets, 1993). Then these net- 
works have been deeply studied in (Xu, 1995). Graphically, the network is a 
directed acyclic graph. But, conditional beliefs are defined in a different manner 
from conditional probabilities in the Bayesian network (BN): each edge repre- 
sents a conditional relation between the two nodes it connects. For example, the 
edges (X,Z) and (Y,Z) in Fig. 1 mean that we have {bel^[xi\ : Xi G Ox} and 
{bel^[yf\ : yi G Oy}, but not {bel^[xi,y(\ : Xi G Ox, Vi G Oy] as in BN. 

However, if conditional beliefs such as {bel^[xi,yi\ : Xi G Ox,yi G Oy} are 
given, Xu’ method build an ENC in which nodes X and Y are merged as one 

In (Cano et al., 1993; Shenoy, 1994b), this belief function is called “conditional belief 
function” and, in (Xu, 1995), it is called “non-informative belief function”. 

® When the VBS is specialized in belief function theory, it is called an evidential 
system, and the valuation network is the evidential network. 
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Fig. 1. An evidential network with conditional belief functions 



node. For any merged node, the belief function is obtained by combining the 
ballooning extension® of each conditional belief function. 

5 DAG Representation of Belief Functions Models: 
Directed Evidential Networks 

In the previous section, we have presented Xu’ graphical representation (i.e. 
ENC) which uses conditional belief functions for the knowledge representation 
and reasoning. By comparing some relations between the representation by joint 
belief and by conditional form, Xu and Smets have shown that the conditional 
form takes less space. Indeed, in ENC, any computations involving two con- 
nected variables X and Y are processed on the space Ox or Oy, while in the 
network with joint beliefs, such computations are always done on the product 
space OxuY- Thus the computations in ENC needs fewer set-comparisons and 
multiplications than that in the latter one. 

Nevertheless, the computation in ENC is not quite efficient because the rea- 
soning process is based on the ballooning extension of conditional belief func- 
tions. Furthermore, the representation and propagation algorithm proposed by 
Xu are restricted because they are for the evidential network which only have 
binary relations between the nodes (Xu, 1995). 

Thus, in this section, we generalize ENC to the case where relations are for 
any number of nodes. In order to distinguish the two kinds of networks, we call 
ours DEVN which means a directed evidential network with conditional belief 
functions. 



5.1 Knowledge Representation Using Conditional Belief Functions 

In this section, we discuss the relationships between joint belief functions and 
conditional belief functions which represent the same knowledge. 

Example 1. Suppose that we have two variables X and Y with frames Ox = 
{xi,X 2 } and Oy = {y 1 , 2 / 2 }) respectively. To represent a relation between X and 
Y such that ”if X = cci then Y = yi with m = 0.7” by a belief function in: 



See Sect. 2.5. 




300 B. Yaghlane, P. Smets, and K. Mellouli 



joint form: the rule is represented by a belief function on the space 0 = Qx x 
Oy = {(a;i, j/i), (xi, 7/2), (a^ 2 ,l/i), (a^ 2 ,l/ 2 )}, with masses: 0.7 on the subset 
{X2,yi), (X2,y2)}, and 0.3 on 0. 

conditional form: the rule is represented by the conditional bba: 

'm-[xi]{{yi}) = 0.7, 
m[xi]{0Y) = 0.3, 
m[x2]{0Y) = 1 , 

m[0x]{0Y) = 1 - 



From this example, it can be shown that the conditional representation is 
more ’’easy” for the users to provide and to understand. In general, to represent 
conditional belief functions for Y given X, by a joint form, it needs 
elements in the worst case, while by a conditional form, it only needs 
elements in the worst case. 

But following (Xu and Smets, 1994), not all belief functions on 0xuv admit 
an equivalent representation by a set of conditional belief functions. Further, they 
think that the users’ knowledge is encoded in the conditional form and that the 
joint beliefs the users would provide are those based on the known conditional 
form. In many situations, the users’ beliefs can be represented by the conditional 
belief functions for Y given Xi € 0x ■ The conditional belief for Y given x C 0x 
is then derived from the disjunctive rule of combination (DRC). Example 1 is 
such a case. In the worst case, it needs only | 0x | elements. 



The Joint bba Generated by the Set of Conditional bba. Let X and 

Y be two independent variables, and let the set of conditional belief functions 
bel^lyi] be defined on X for each yi G Y. The conditional belief functions are 
considered as produced by distinct pieces of evidence. In practice, this means 
that the knowledge of the value of bel^[yi] does not produce constraints on what 
might be the values of bel^[yj] for j ^ i. 

We want to build a bba on XY such that its conditioning on {X,yi) repro- 
duces bel^[yi] when the conditional belief functions are generated by distinct 
pieces of evidence. The solution, presented in (Smets, 1978; Smets, 1993) is ob- 
tained as follows: 

Each bel^lyi] is extended by a ballooning extension on the frame XY by 
relation (1) and the results are conjunctively combined. The value of the resulting 
belief function is given in the following theorem. 

Theorem 1. Let X and Y he two independent variables, and let the set of eon- 
ditional belief functions bel^[yi] be defined on X for each yt G Y. Let w C XY . 
For each yi G Y, let xf{w) be the projection on X of the elements of w 
which intersect (X,yi): xf{w) = (ic fl {X,yi))^^ . Then the conjunctive com- 
bination of the ballooning extensions of each conditional belief function on XY , 
= @y^^Y'm^\yi]’^^^ , admits the following representations: 
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m 

b 

pi 

q 



XY 

XY 

XY 

XY 



{w) = 




( 3 ) 




ViGY 




{w) = 


n b^ly^KxfH) 


( 4 ) 




Vi&Y 




(w) = 


1 - n (1 H) 


( 5 ) 




VieY 




(w) = 


n q^[y.]{xf{w)) 


(6) 



yi<^Y 



The formulas of theorem 1 are proved in (Smets, 1978; Smets, 1993). 



This construction is linked to the concept of a joint belief function with a 
vacuous marginal by the next lemma. The joint belief function built according 
to theorem 1 is a joint belief function with a vacuous marginal over Y once all 
the conditional belief functions are normalized. 

Lemma 1. Let {bel^[y] : y G Oy} be a family of normalized conditional belief 
functions for X given Y. Then bel^^ = @y^ 0 Ybel^ [y^^^ is a joint belief 
function with a vacuous marginal over Y . 



5.2 Directed Evidential Network Model 

In the following, we relate belief functions with directed acyclic graphs and intro- 
duce the notion of directed evidential network with conditional belief functions 
(Ben Yaghlane and Mellouli, 2001). The directed evidential network (DEVN) 
model is represented by: 

1. A knowledge base : we can distinguish two levels: qualitative and quan- 
titative. At the qualitative level, we have a directed acyclic graph (DAG) 
in which the nodes represent variables and directed arcs describe the condi- 
tional dependence relations embedded in the model (i.e. the link between the 
variables). At the quantitative level, the dependence relations are expressed 
by conditional belief functions for each variable given its parents. 

2. Facts representing the new observations introduced in the network and that 
will be represented by belief functions allocated to some nodes. 

Directed evidential networks are similar to Bayesian networks but instead of 
using conditional probability functions, we use conditional belief functions. No- 
tice that directed evidential networks (i.e. belief function models) have a greater 
expressive power than probabilistic ones (i.e. Bayesian networks); however they 
are more complex, and often have higher computational cost (Almond, 1995). 

— Each variable X in the DEVN has a set of possible values, called frame of 
discernment, that consists of mutually exclusive and exhaustive values of the 
variable. Parents of X are denoted by Pax- 
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Fig. 2. A Bayesian network drawn as an evidential network 



— For each root node X {Pax = 0), uncertainty is represented by an a priori 
belief function bel^ on X. 

— For the other nodes (i.e. Pax ^ 0), uncertainty is represented by a condi- 
tional belief function bel^[Pax] on X given the value taken by its parents. 

Belief Chain Rule. Given all the a priori and conditional belief functions, the 
joint distribution relative to the set of variable {Xi, ..., ATat} is computed using 
the following belief chain rule: 

So, corresponding to each variable Xi, in the directed evidential network, we 
create a belief function bel^'[PaXi] over the frame OxtUPax ■ 

5.3 Links with Bayesian Networks and Valuation Networks 

In a Bayesian network, there is only one incoming conditional probability func- 
tion attached to a node. So a Bayesian Network is an evidential network where 
we keep first the diamonds, but there is only one arrow that enters into a node 
(we don’t have Dempster’s rule of combination.). See Fig. 2 for an example. 

Note that the evidential network of Fig. 3 cannot be represented as a Bayesian 
network, what reflect we lost Dempster’s rule of combination. 

We can change the meaning of the links and use the Bayesian approach, in 
which case we obtain Fig. 4. We can profit from the fact that we have conditional 
belief functions and quite a simplified evidential network. 

6 Conclusion 

There are two types of graphical models commonly in use: undirected graphs 
and directed graphs. Directed graphs are more appropriate for representing con- 
ditional relationships (Pearl, 1988), and in particular, conditional belief func- 
tions whereas Shafer-Shenoy (Shenoy and Shafer, 1990) used undirected graphs 
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Fig. 3. An evidential network 




Fig. 4. Directed Evidential Network (DEVN) 



as they work with joint belief functions. In many applications, conditional belief 
functions provide a more natural representations of the knowledge, and are easier 
to collect, hence the interest of the directed evidential networks with conditional 
belief functions. 

Inference algorithms in knowledge-based systems with a directed evidential 
network (DEVN) obtain their efficiency by making use of the represented inde- 
pendencies in their network. This can be done by using two rules proposed 
by Smets and called disjunctive rule of combination (DRC) and generalized 
Bayesian theorem (GET) which make possible the use of the conditional belief 
functions directly for reasoning in the directed evidential networks, avoiding the 
computations of joint belief function on the product space (Ben Yaghlane, 2002). 
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Abstract. Binary join trees have been a popular structure to compute the im- 
pact of multiple belief functions initially assigned to nodes of trees or networks. 
Shenoy has proposed two alternative methods to transform a qualitative 
Markov tree into a binary tree. In this paper, we present an alternative algo- 
rithm of transforming a qualitative Markov tree into a binary tree based on the 
computational workload in nodes for an exact implementation of evidence 
combination. A binary tree is then partitioned into clusters with each cluster be- 
ing assigned to a processor in a parallel environment. These three types of bi- 
nary trees are examined to reveal the structural and computational differences. 



1 Introduction 

Expensive computational cost of Dempster’s rule in DS theory led to a stream of study 
on efficient implementations of the rule over the last two decades. The proposed ap- 
proaches have emphasized either exact implementations of the rule (e.g., [1, 6, 10, 11, 
13, 17], etc.) or its approximations (e.g., [2, 3, 4, 15, 16], etc.), under the assumption 
that evidence distributions follow certain structures. 

Among all these approaches, the method on belief propagation in qualitative 
Markov trees has been popular. As proved in [11], with this method, the exponential 
computational complexity in the size of total variables is reduced to the size of the 
largest node in a tree, a node with the largest number of variables. The major tech- 
nique supporting the method is local computation [14], which was initiated for 
propagating probabilities in Bayesian causal trees by Pearl [8]. Local computation 
refers to computation that involves only a small number of nodes in a large tree (or 
network). The basic idea of local computation is message passing among neighboring 
nodes in a qualitative Markov tree to compute marginals of the joint belief distribu- 
tion without actually calculating the joint belief distribution. The sizes of nodes in a 
qualitative Markov tree determine how efficient the local computation can be. If a 
node has a large number of neighbors, even local computation can be very inefficient. 
To solve this problem, concept binary joint trees (or simply binary trees) was pro- 
posed by Shenoy in [12], in which every non-leaf node performs at most one combi- 
nation, and the corresponding algorithm was introduced. Using this algorithm, any 
qualitative Markov tree can be first transformed into a binary join tree on which local 
computation can be carried out. Subsequently, Shenoy improved this transformation 



T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 306-318, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




Computational-Workload Based Binarization and Partition 



307 



algorithm in [13]. The improved algorithm constructs a binary tree based on a one- 
step lookahead technique which looks for an optimal order of variables being deleted. 
Given a qualitative Markov tree, these two algorithms construct rather different join 
trees, with the second tree bearing little resemblance to the original Markov tree. 
Furthermore, the one-step lookahead algorithm adds many more nodes in the trans- 
formation procedure than the first algorithm. These extra nodes will increase the 
computational costs in terms of calculating marginals for them. 

In this paper, we propose a different method to transform a qualitative Markov 
tree into a binary tree, based on the amount of combinations at each sub-tree. A bi- 
nary tree derived in this way is almost balanced in respect to the workload of combin- 
ing evidence. We then partition a binary tree into a set of clusters with the intention 
that each cluster will be assigned to a processor in a parallel processing environment. 
The appearance of our binary tree is similar to the tree obtained from Shenoy’s first 
approach, except that Shenoy’s tree may have more added nodes because this tree 
permits only one combination in each node. However, our study on these two similar 
types of binary trees shows that Shenoy’s tree requires less amount of computation 
than ours since it eliminates some duplicate combinations. Our study on Shenoy’s 
second approach reveals that a binary tree obtained in this way (it permits one combi- 
nation per node as well) is very complex and adds much extra computation due to a 
large number of added nodes, comparing to his first approach. 

Other algorithms that binarizing a qualitative Markov tree into a binary tree in- 
clude a straight-forward binarization procedure for approximate computation in Baye- 
sian networks [2] and a computational workload based algorithm for executing a 
parallel program using multiple processors [7]. Our algorithm is similar to the proce- 
dure in [2] in respect to the structure of the tree, that is, each node has maximum two 
children. However, our algorithm is more comprehensive because it assesses the 
amount of computation at each sub-tree before merging two sub-trees together. As a 
result, our binary tree is a balanced one while a tree from [2] can be extremely unbal- 
anced (see the example in Section 3). If there are many processors available to proc- 
ess some nodes (sub-trees) in parallel [5], a balanced tree provides a good structure to 
partition it into clusters so as to assign workloads to processors evenly. Our algorithm 
is also similar to the binarization procedure in [7] in the sense that the latter considers 
workloads on sub-trees as well when merging two sub-trees (units of a parallel pro- 
gram). The difference between them is that our algorithm needs to consider the 
amount of computation being carried out in added nodes (which may affect the total 
workload of a sub-tree with this added node as the root). While the algorithm in [7] 
does not involve computation in added nodes (only message passing). 

The rest of the paper is organized as follows. Section 2 introduces the basics of 
DS theory and the terminology for belief propagation in qualitative Markov trees. 
Section 3 provides the algorithm that transforms a qualitative Markov tree into a 
binary join tree and partitions a binary tree into clusters for parallel processing. Sec- 
tion 4 reviews the two algorithms on binarization proposed by Shenoy, with exam- 
ples. Section 5 provides a detailed analysis and comparison of our algorithm with that 
of Shenoy’s and concludes the paper. 
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2 DS Theory and Qualitative Markov Trees 

2.1 Basics of DS Theory 

In the Dempster-Shafer theory of evidence (DS theory) [9], a piece of information is 
described as a mass function on a set of mutually exclusive and exhaustive elements, 
known as a. frame of discernment (or simply a frame), denoted as 0. A mass function 
m: 2®^ [0,1], represents the distribution of a unit of belief over a frame, 0, satisfying 
the following two conditions: m(0) = 0 and X^^gm(A)=l . A belief function over 0 is 
a function Bel: 2®^ [0,1], satisfying Bel(A)= m(B). When several belief func- 
tions are obtained through distinct sources based on the same frame of discernment, a 
new belief function representing the consensus of them can be produced. Assume that 
Belj and Bef are two such obtained belief functions on the same frame 0, the com- 
bined impact of them is calculated using the Dempster’s rule of combination, Bel = 
Belj © Bel^. The computational complexity of combining two belief functions over a 
frame is exponential to the size of the initial frame. 

2.2 Qualitative Markov Trees 

Qualitative Markov Trees: We use graph-oriented terminology and notation for 
qualitative Markov trees here. Let a pair (V, E] be a graph, with V a finite set of 
nodes (or variables) and E a set of unordered pairs of distinct nodes in V. A qualita- 
tive Markov tree is a graph which has no cycles, and any variable in two nodes should 
be in any node in the path linking them. Elements in V are denoted using capital let- 
ters, such as A, B, S, and subsets of V are denoted with lower cases, such as, x, y, z. A 
qualitative Markov tree can either be derived from a Bayesian network [2, 13] or 
from a diagnostic tree [11, 14] as shown in Fig. la and Fig. lb respectively. 




Fig. 1. Two examples of qualitative Markov trees 



(b) 



When a qualitative Markov tree is constructed from a diagnostic tree, the collec- 
tion of all leaf nodes defines the overall frame of discernment represented by the root. 
Any non-leaf node, such as e, contains all the leaf nodes in the sub-tree with this node 
as the root. The corresponding frame for belief combination is {e, -ej={A, B, C, -ef 
While the frame for a node in Fig. la is the Cartesian product of its variable frames. 
In the rest of the paper, we use a qualitative Markov tree in the form of Fig. lb. Our 
algorithm and discussions are equally applicable to a tree in the form of Fig. la. 
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Variables and Configurations; Let x be a node in a qualitative Markov tree repre- 
senting a set of variables and 0^ be the frame corresponding to x. Elements of 0^ are 
referred to as configurations of x, denoted by bold-faced lower cases, such as, g,f, h. 

Projection and Extension: Let g and h be two sets of variables, h ^ g, and g is a 
configuration of g. The projection of g to 0^, denoted by g is a configuration of h. 
Let G be a non-empty subset of 0^, the projection of G to h, denoted by G^‘‘, is ob- 
tained by G"^* = g ^ C}. If g and h are two sets of variables, h cz g, and H is a 
subset of 0j, then the extension of H to g, denoted by \ is Hx 0^ 

Marginalization: If m is a mass function on g, and h^g,hi^(S>, the marginal of m 
on h, denoted by m , is a mass function on h defined by 

£ {m(G)|Gc0g,G'*''’ =//}. 

0n the other hand, if m is a mass function on h, and h ^ g, h 0, the marginal of 
m on g, denoted by is a mass function on g defined by 

m^«(G)= =G). 

Belief Propagation: Let {V, E] be a qualitative Markov tree on which a set of be- 
lief functions are assigned to its nodes. Given a node x, V = {i\ii,x)eE} denotes the set 
of neighbours of x, a set of nodes that are directly linked with x. Bef represents the 
initial belief function assigned to node x. To propagate initial belief functions to ob- 
tain the final marginal on a designated node (containing a set of variables), the 
propagation scheme starts with the leaves of a qualitative Markov tree and moves step 
by step towards the targeted node. Each time a node x sends a message Af^‘, referring 
to the belief function sent by x to i, to each of its neighbors, 

=((Bef I ke . (1) 

Eor a leaf node x with only one neighbor i, M‘^‘ is reduced to 
M . After the designated node y has received the messages 
from all of its neighbors, the marginal for y is obtained as 

Bel^^=Bel^9(®{M‘^\ieV^]). As stated in [13], a qualitative Markov tree can always be 
re-constructed as a rooted one. In this paper, we concentrate on rooted qualitative 
Markov trees. Let node r be the root of a Markov tree, x be a node. Let Ch^= {k\ke V^, 
k is a child node of x) be the set of children of x, and P^= {p} be the parent of x. The 
belief propagation scheme can be carried out in two phases to calculate the combined 
beliefs on any node [11]: 

Phase I. Propagate Messages up the Tree: starting at leaf nodes, messages are 
sent up step by step. = {{Bel^ ©(©[M*^^^ \ke . 

The maximum number of belief functions accumulated in a non-leaf, non-root 
node in this phase is 7 h- \Chj, if this node has an initial belief function and every of its 
child node sends a message to it. Therefore the number of combinations is (1+ \Chf) 
-1, i.e. \Chf Eor a leaf node, no combinations are involved. Eor the root node, there 
are maximum 1+ \Ch\ belief functions accumulated. Since the root will not send any 
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messages up. Phase I stops here. After computing the marginal of the root, messages 
are then sent back down the tree. Therefore, we will count the total number of 
combinations in the root in the next phase. 

Phase II. Propagate Messages down the Tree: starting at the root node, messages 
are sent back down step by step. 

Mx^k I y'G . (2) 

For a non-root, non-leaf node x, the maximum number of belief functions accumu- 
lated for propagating down to its child node kis 1+ |Pj,| + so the number of 

combinations is \Ch^ (with |Pj,|=7), if we have stored every in Phase I. The 

maximum total number of combinations in x is |C/Ij,|a|C/Jj^|. Its final marginal is 

Bel^'‘=Bel^ ©M^^^ ©(©{M'^^ | /g Ch^}). (3) 

If the marginal of the joint for x from Equation (2) is reserved before it is projected 
to node k, then it can be incorporated into Equation (3) to replace all the messages 
except Equation (3) can be rewritten as Therefore, there is 

only one extra combination to obtain the final marginal for a node. 

Because a root has maximum 1+ \Ch\ belief functions, the maximum number of 
combinations for propagating a message down a branch is |C/jr|-l (the message from a 
branch to which the message is being sent will not be combined with the rest). The 
maximum total number of combinations is (\Ch\-l)y\ChX The root needs one combi- 
nation for its final marginal. A leaf node also needs one combination for its final mar- 
ginal. The total number of combinations in a qualitative Markov tree is the sum of 
numbers of combinations of all the nodes. 




Fig. 2. A rooted qualitative Markov tree with maximum number of combinations in each node 
when an initial belief function is assigned to each node and a final marginal is required for 
every node 

1) indicates that in node x, there are t, combinations when x sends a mes- 
sage to its parent, there are ^ combinations when it sends messages to all of its chil- 
dren, and there is one extra combination to obtain the final marginal for x. When ^ or 
L is zero, we have omitted it from the above graph. 
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3 A Weight-Based Binarization Algorithm 

When binarizing a qualitative Markov tree, for each non-leaf node x with more than 
two children, we repeatedly merge two of its children to get a new one with these two 
children carrying the least amount of computation, until x has only two children left. 
Such a binary tree should have almost balanced workloads among its branches. 

Although a new affiliated node is added whenever two branches are merged, these 
newly created nodes will only calculate and store some intermediate results of combi- 
nations and no computation is required to calculate their own marginals. In the algo- 
rithm below, comb(x) represents the total number of combinations in node x, and 
comb(TJ is the total number of combinations in sub-tree with x as the root. 



Algorithm: Binarization of a Qualitative Markov Tree (BQMT) 

Input: a qualitative Markov tree with a designated root r. 
x<—r. 

Procedure Binarization (x): 

1. If V is a leaf node and x=r Then comb(x) ^-0, comb(T^) ^-0. Terminate the Procedure. (The 
tree has only one level, the root is also a leaf.) 

2. If V is a leaf node and x?^ Then comb(x) <^1, comb{T) <—l. Terminate the Procedure. (The 
tree has more than one level.) 

3. For each child node c, e Ch^ do Binarization (c,). 

4. Sort Ch^ 'm ascending order, where Ch=[c,, ..., cj satisfying 

comb(TJ < comb(T^I) if Cj is after c, in the ordered set Ch^. 

5. 1^1. 

6. While \Chj > 2 do 

6.1 Select c^, c^, the first two elements in Ch^; 

6.2 Create a new node v, (x=(Cj U when we use a tree in the form of Fig. la; 

x=(Cj U Cj) when a tree is of the form Fig. lb) to connect c, and c^, replace 
sub-trees T , and T, with the new sub-tree T, with xas the root; 

6.3. comb(x) 3, comb(TJ<r- comb(x) + comb{T^f) + comb(T^f); 

6.4. Remove and c^from Ch^, inserts, into Ch^ in sorted order; 

6.5. 1=1+1. 

7. If \Chj =1 Then 

7.1 If X is the root Then comb{x) +- 1 Else comb(x) <— 3; 

7.2 combiTf) e— comb(x) + comb{T^f). 

Else 

7.3 If X is the root Then comb(x) <— 3 Else comb(x) 7; 

7.4 comb{Tf) e— comb(x) + comb{T^f) + comb(T^f). 

Return (T): A binary tree with the same root 



For each newly added node, the maximum number of belief functions accumulated 
in it is \ChJ (it has no initial belief function) instead of 1+ \ChJ, so, the maximum 
number of combinations is i'l l,i2,0)=3. Applying this algorithm to the tree in Fig. 2, 
we get a balanced binarised tree as in Fig. 3 where bold-faced nodes are added nodes. 
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However, if we do not merge the two branches with the lightest workloads each 
time during the binarization procedure, we could end up with a tree that is totally 
unbalanced. For example, an alternative binary tree from the qualitative Markov tree 
in Fig. 2 can be in the form as shown in Fig. 4. This unbalanced tree will make the 
parallel processing much less efficient if we were to use multiple processors to proc- 
ess each part simultaneously [5]. Given a multiple processor environment, it is in 
general not possible to assign each node to a processor, due to the fact that either 
there are less processors available than the total number of nodes or the communica- 
tion cost between processors are too expensive comparing to the calculation. 





Below is a clustering algorithm that partitions a binary tree into clusters and as- 
signs each cluster to a processor. In [5], we have been testing the algorithm in a four- 
processor environment where a tree in Fig. 3 is partitioned into four clusters as illus- 
trated with shade. These four processors perform combinations simultaneously start- 
ing from leaves. Each sends its results to the processor that contains its parent for 
further combination before it calculates final marginals for its nodes. This algorithm 
also partitions the tree in Fig. 4 into four clusters as shown. However, the processor 
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that contains node (b,c) has to wait for the result of other three processors which deal 
with clusters from its right branch before it can go further. Therefore, the intended 
parallel process is reduced to almost a linear one, in addition to the extra cost of 
communication between processors. 



Algorithm: Clustering a Binary Tree 

Input: r - the binary tree with the root r,N - the number of processors provided 

1. Create two empty queues S and 5, {S is the working queue, 5, is the temporary queue); 

2. 5 <— {r}, counterm 1; 

3. While m < N and queue S is not empty, do 

3.1 Select the first element v in S and let 5 <— 5/{v); 

3.2 If V has no children. Then 5, <— 5, U (v); m=m+l; 

Else 

3.2.1 If V has one child. Then 
p^v 

While p has one child, do p the child of p 

Let and be the children of node p; 

If \comb(Tc^)-comb(Tc^)\ < 5 (8 is a threshold saying that both branches 

have almost the same workload) 

Then 

Disconnect Tc^ from Tp; 

comb(Tp) = comb(Tp)- comb{TCg),comb{Tv) = comb(Tv)- comb{TCg)\ 
5 <— 5 U {v, Cj), m=m+l; 

Else 

Let w be the root of p’s bigger child subtree; 

Disconnect Tw from Tp, comb{Tv) = comb(Tv)- comb{Tw); 
comb(Tp) = comb(Tp)- comb(Tw); 

While \comb(Tv) - comb(Tw)\ > 5, do 
Reconnect Tw to Tp\ 
combiTp) = comb(Tp) + comb(Tw); 
comb(Tv) = comb(Tv) + comb{Tw)\ 
p <^w. 

Let w be the root of p’s bigger child subtree; 

Disconnect Tw from Tp; 
combiTp) = combiTp)- combiTw); 
combiTv) = combiTv)- combiTw); 

5 <— 5 U {v, w}, m=m+l; 

3.2.2 Else 

Let and be the children of node v; 

If \combiTCj) - combiTc^\ < 8, Then 
Disconnect from Tv; 
combiTv) = combiTv)- combiTc^); 

5 S U {v, Cg], m=m+l; 

Else 

Let w be the root of v’s bigger child subtree; 

Disconnect Tw from Tv combiTv) = combiTv)- combiTw); 

While \combiTv) - combiTw)\ > 8, do 
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If V was w's parent node. Then 

Reconnect Tw to Tv, comb(Tv) = comb(Tv)+ comb(Tw)', 

Else 

p <— w's parent node, reconnect Tw to Tp\ 
comb(Tp) = comb(Tp) + comb(Tw); 
combiTv) = combiTv) + comb(Tw)', 
p <—w. 

Let w be the root of p’s bigger child subtree; 

Disconnect Tw from Tp\ 

comb(Tp)=comb{Tp)-comb{Tw), comb(Tv)=comb(Tv)-comb(Tw); 
S<—S U {v, w], m=m+l; 

4. S^SuS- 

5. Each element of S leads a cluster; assign each cluster to a processor. 



4 Shenoy’s Binary Trees 

Shenoy described an approach to constructing a binary join tree in [12] where each 
node accumulates at most two pieces of evidence. This approach was further devel- 
oped based on a one-step lookahead heuristic search to construct binary trees [13]. 

Approach in [12]: For a given qualitative Markov tree (that could be derived from 
a valuation network), a designated node is chosen as the root. Starting from the root, 
the total number of belief functions that are to be combined in the root is counted, in 
order to obtain the marginal for the root. If the root has n children, each sending it a 
belief function, in addition to its own, there will be (n+l)-l=n combinations. When 
n>2, multiple replicates (n-7 replicates) of the root are created. These multiple copies 
and the children of the root are re-organized so that there is only one combination in 
each replicate node and in the root. For each tree in the forest obtained by ignoring 
the root and all its multiple copies (and the links from them), repeat the above proce- 
dure until every node has only one combination. It should be pointed out that this one 
combination in a node only contributes to the final marginal of the joint for the root. 
That is, this one combination happens in Phase I in Sect. 2. If the marginals for some 
non-root nodes are required, there will be at least another (and at most two) combina- 
tion in such a node when messages are propagated down the tree, depending on 
whether the node has one child or two. The total number of combinations in each 
node is shown in Fig. 5, if we assume that a marginal for every node is of interest. 

One-Step Lookahead Heuristic Approach in [13]: Given a qualitative Markov 
tree, to compute the marginal for a node, alternative sequences of combinations of 
belief functions with different amount of computation, can be carried out. The heuris- 
tics in the one-step lookahead approach schedules combinations by sequencing dele- 
tion of variables from a qualitative Markov tree. The variable to be deleted next is the 
one that leads to a combination over the smallest set of configurations [13]. Starting 
from the nodes containing variables that should be deleted first, this method con- 
structs a binary tree with these nodes as initial leaves, and build the binary tree as 
more variables to be deleted. Let Q be the set containing all the nodes in a Markov 
tree (network), we summarize Shenoy’s algorithm as follows. 
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Fig. 5. A binary join tree created using Shenoy’s first method based on the original Markov 
tree in Fig. 2. Bold-faced nodes are added ones. The multiple replicates of a node are numbered 
following the name of the original node 

Step 1. Selecting the variable(s) that should be deleted first, a subset O of Q is 
formed with each element in O containing this (or these) variable(s). Let Q=Q\0. 

Step 2. A pair of elements in O is chosen with the union set of the pair containing 
the minimum number of variables among all possible unions of pairs in set O. The 
elements in the pair are two leaves and removed from O. The union set of the pair is 
created and inserted into O. This new node acts as the parent of the two leaves. 

Step 3. The above step is repeated until O has only one node, m, left. Create a new 
node, n, containing the remaining variables after deleting the chosen variable(s) from 
m and let n be its parent. Let Q.=Qyj{nj. 

Step 4. Repeating Steps 1 to 3 for the next chosen variable(s) that should be de- 
leted, until Q has one element left which will be the root of the created binary tree. 

When using Shenoy’s second approach to transforming the qualitative Markov tree 
in Fig. 2, we assume that each leaf node contains one variable, such as e={E}, and 
every non-leaf node contains the collection of variables in the leaves below it, such as 
b={E,Fj. Based on the deleting sequence (E}, {Fj, (Kj, {Lj, {Mj, fH}, flj, (Nj, {01, 
and with {Dj as the final remaining variable, the binary tree is built as shown in 
Fig. 6. All the leaf nodes are the initial subsets of variables with original belief func- 
tions. All the other nodes are inserted later in order to either merge two subsets or to 
delete a variable(s) from a merged node, with latter being denoted with bold-faced 
font. If the final marginal is required for every original subset, the total number of 
combinations in each node is shown in Fig. 6. In summary, the total numbers of com- 
binations in the original Markov tree (Fig. 2), in the binary tree in Fig. 3, and in the 
binary trees in Fig. 5 and Fig. 6 are 64, 53, 50 and 55 respectively. 
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5 Comparison among These Three Types of Binary Trees and 
Conclusion 

Where Does Binary Occur? The binarization in our structure means that each node 
in the final tree should contain no more than two child nodes. If a node has its initial 
belief function in addition to that from its two children, then there will be three belief 
functions requiring two combinations. However, this is not allowed in Shenoy’s ap- 
proaches. Since in his structures, there can only be one combination in each node. 
Therefore, a replicate of this node will separate the node with its two children. For 
example, node al, a replicate of a, separates nodes a2 (another replicate of a) and D, 
in Fig. 5. Such a replicate node does not exist in our binary tree. 

Structural Differeuces. In Shenoy’s first approach, the number of nodes being 
added depends on how initial belief functions are assigned. For instance, if a does not 
have its initial belief function, then there will only be one replicate of a, since there is 
only one combination. Therefore, the structure (in terms of total number of added 
nodes) of a binary tree changes when different sets of belief functions are given ini- 
tially. On the contrary, in our algorithm, the number of nodes being added is fixed. 
The only difference that alternative sets of initial belief functions make is the ar- 
rangement of sub-branches, due to the change of calculations in different branches. 

Computatioual Cost. In terms of computational cost, our algorithm has some ex- 
tra combinations than Shenoy’s first algorithm. The extra combinations occur when a 
node has a belief function from its parent and one of its own. These two belief func- 
tions are combined twice before it being combined with that from one of its children 
for the purpose of sending the combined belief function to another. 

Why There Are So Mauy Added Nodes? The major step in Shenoy’s second ap- 
proach is to arrange leaves (all the nodes in the original Markov tree) based on the 
deletion sequence, and to construct sub-trees. Each sub-tree results in at least two 
newly added nodes, one is the union of variables in the two nodes being merged and 
another (the root of the sub-tree) contains variables after deleting selected variables 
from the former. All the non-leaf nodes are added ones. In Fig. 6, there are 23 added 
nodes in comparison to 4 and 9 added nodes in Fig. 3 and Fig. 5 respectively. More- 
over, these added nodes cause more calculations on marginals. For example, when 
propagating up the tree, each bold-faced node requires its marginal from the node 
below it after deleting a variable. A marginal is again required the other way round. 
Therefore, this method is expensive in both the total number of combinations and in 
preparation for combinations. 

Conclusion. In this paper, we proposed a computational workload-based algo- 
rithm to transform a qualitative Markov tree into a binary tree and an algorithm for 
partitioning the tree into clusters for parallel process. We also compared our binary 
tree with that constructed from Shenoy’s two algorithms in [12] and [13] respectively. 
The study shows that Shenoy’s first algorithm requires the least amount of combina- 
tion and it is the most efficient one if only one processor is used. Our binary trees 
require some extra combination than Shenoy’s first type of trees, but should perform 
well when multiple processors are available. Shenoy’s second type of trees is expen- 
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sive both in a single or a multiple processor environment due to large number of 
added nodes and additional calculations on marginals for added nodes. 
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Abstract. This paper presents an original method for risk assessment 
in water treatment, based on belief functions. The risk of producing 
non-compliant drinking water (i.e., such that one of the quality param- 
eter exceeds the regulation standards), is estimated taking into account 
the quality parameters of raw water aud the process line of the treat- 
ment plant (technology, different failure modes and corresponding failure 
rates). Uncertainty on available data (treatment steps efficiency, failure 
rates, times to repair and raw water quality) is modeled using belief 
functions that are combined to compute a degree of confidence that the 
produced water will meet quality standards. The methodology recov- 
ers the classical results (obtained by fault tree analysis) as a limit case 
when uncertainties on input data are modeled by probabilities, and still 
provides informative results when only weaker forms of knowledge are 
available. 



1 Problem Description 

The production and distribution of good quality water to consumers is a fun- 
damental issue, given the medical and financial consequences that could result 
from the delivery of insufficient quality water. It is therefore necessary to set up 
a treatment process adapted to the raw water to be treated, and to estimate the 
residual risk of producing water that does not comply with the regulation (in 
most cases the problem is detected by real time monitoring and the production 
is stopped, but this induces heavy financial penalties). 

In the classical approach, this risk assessment process is performed by deter- 
mining the probability of the undesirable event “production of non-compliant 
water”, taking into account the quality of the resource to be treated (i.e., the es- 
timated probability to find a given concentration of an undesirable component), 
characteristics of the treatment unit (efficiency of the treatment steps), as well 
as different failure modes that can occur in the process line (failure rates and 
repair times of each mode) . Such a process was developed and described in detail 
in previous papers [5,1]. 
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One major difficulty in applying risk assessment methods in the environmen- 
tal engineering domain is that basic data are not perfectly known and are often 
determined by expert judgement with a high level of uncertainty. In this paper, it 
is proposed to model the uncertainty on raw water quality, process line efficiency 
and state of the treatment plant (nominal or failure mode) in the belief function 
framework. Each source of information will be modeled by a belief function and 
combined to obtain an assessment of the plausibility to produce non compliant 
water. 

In the limit case where the basic belief functions are probabilities, the pro- 
posed methodology recovers the results obtained by the classical approach: in 
that case, the belief function obtained for the variable representing the non 
compliance of produced water is also a probability measure. In the general case, 
however, the belief function obtained is no more a probability and it is possible 
to compute the belief, the plausibility and the pignistic probability to produce 
non compliant water. These results can be used to estimate a level of confidence 
to meet contractual requirements with a given treatment plant technology, and 
therefore to help treatment plant designers to choose the optimal architecture, 
given an objective level of residual risk. 

The rest of the paper is organized as follows. The classical methodology is 
first recalled in Sect. 2. Our approach is then described in Sect. 3, and compared 
to the classical approach in Sect. 4. Simulations are presented in Sect. 5, and 
Sect. 6 concludes the paper. 

2 Classical Approach 

The following is just a brief reminder of the major concepts described in [5,1]. 
The current regulation on potable water takes into account 62 quality parameters 
of various types (turbidity, colour, concentration of mineral or organic compo- 
nents, physicochemical properties...). To meet these requirements, the treatment 
process must be adapted, on the one hand, to the general quality of water, and 
on the other hand to exceptional pollution peaks. To take into account this re- 
source variability, a treatment plant is composed of the succession of various 
treatment processes (preoxydation, clarification, polishing, disinfection...). 

For each quality parameter, the efficiency of each treatment step is repre- 
sented by a transfer function, giving the output concentration Cout as a function 
of the input concentration Cin for the considered parameter. In most cases, this 
transfer function is linear and can be expressed using a single parameter a, called 
the abatement rate or reduction factor: Cout = (1 — oi)Cin- It is also possible 
to account for nonlinear transfer functions by defining different abatement rates 
according to the input concentration, but we will not consider that case in this 
paper. 

By combining these local transfer functions (established for each treatment 
step), it is possible to define a global transfer function, which represents the 
global efficiency of the treatment line for a given parameter. This global trans- 
fer function must be determined for the nominal mode of the treatment plant 
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Fig. 1. Transfer function (output concentration of an undesirable water characteristics, 
as a function of the input concentration) in nominal mode and for n degraded mode 
of the treatment plant. In this linear approximation, the transfer function for mode i 
only depends on an abatement rate Oi, such as Cout = (1 — ai)Cin 



(abatement rate ao) and also for all possible failure modes. The “Failure Modes 
Effects and Criticality Analysis” (FMECA) methodology allows to determine, 
for each of the n possible failure modes, the corresponding degraded abatement 
rate ai, the failure rate Xi and the repair time T^. The probability to be in failure 
mode i is then 



Pi = XiTi, (1) 

and the probability of being in the nominal state is given by 

n 

Po = ^-'^Pz- ( 2 ) 

i=l 

Based on previously defined transfer functions, it is possible to define accept- 
able water quality thresholds for raw water in each operating mode (nominal and 
degraded) of the treatment plant, by inverting all global transfer functions and 
by applying these inverse functions to the normalized threshold N imposed for 
produced water. This step is illustrated in Fig. 1: n -I- 1 thresholds concerning 
raw water are obtained as: rji = iV/(l — at), (0 < z < n), and n + 2 possible 
raw water states are defined as: e, = [r]i,r]i-i\, (0 < z < n -I- 1), with Pn+i = 0 
and Z 7 _i = oo. Two possible states for the produced water are also defined: Sq 
(corresponding to non compliance with the norm: Cout > N) and si {Cout < N). 

The last step of classical methodology is performed via Fault Tree Analysis 
(FTA), as illustrated in Fig. 2. For the considered parameter, the top level event 
of the fault tree is “Produced water in state sq (non compliant with norm Ny\ 
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Fig. 2. Fault tree analysis of event “Production of water in state so” (non conform to 
the norm) 



The first level of the fault tree is a decomposition between the different possible 
modes (nominal and failure modes) of the treatment plant by the top level OR 
gate. The plant being in a given mode i will not accept input concentrations 
exceeding rji for the considered parameter: second level AND gate. The raw 
water states whose concentration is more than r/i are states ej (0 < j < z) : third 
level OR gate. The minimal cutset is easily obtained and the probability of non 
compliance (unavailability concerning the considered parameter) is given by: 

n i 

p{so) = EE Piqj , (3) 

i—O j=0 

where qj is the probability that the raw water is in state Cj. The global unavail- 
ability is simply obtained by adding the contribution of each quality parameter. 



3 Belief Function Approach 

3.1 Rationale 

The classical solution described above assumes the availability of precise and 
complete prior knowledge of transfer functions, failure rates and repair times, 
as well as enough historical data to estimate the distribution of water quality 
parameters. However, such knowledge and data are usually not available, in par- 
ticular in the case of call for bids, where a proposal must be submitted based on 
partial information. Moreover, transfer functions, repair time and failure rates 
can only be obtained by tests laboratories, expert knowledge or feedback from 
operational sites, which generally does not allow to obtain reliable estimates for 
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a specific site. That is why an approach integrating these various uncertainties 
was developed. The belief function framework was chosen because of its flexi- 
bility for representing weak forms of knowledge [7], and because it generalizes 
Probability Theory, allowing to recover the classical results when all required 
data are available. 



3.2 Notations and Background 

The interpretation of belief functions adopted in this paper is that of Smets’ 
Transferable Belief Model (TBM) [9] . In this model, a belief function is under- 
stood as representing an agent’s state of belief, without resorting to an underly- 
ing probability model. Only the essential definitions and specific notations will 
be given here. A detailed exposition of the TBM may be found in [9] . 

A basic belief assignment (bba) on domain (or frame of discernment) X is 
noted (for convenience, we use the same notation X for a variable and 
its domain). It is defined as a function from the powerset 2^ of X to [0,1] 
verifying ’ti^(A) = 1. The corresponding belief and plausibility functions 

are defined, respectively, as: 

hel^{A)= Y, (4) 



pl^{A)= Y (5) 

BnA^H 

Given a bba defined on the Cartesian product of two domains X and Y, 

the marginal bba on X is defined, for all A C X, as 

^ ^ m^x^(B) , (6) 

{BCJVxy I Proj(B^X)=A} 

where Proj(B | X) denotes the projection of B onto X, defined as 

Proj(B iX) = {x€X\3y€Y, (x,y) G B} . (7) 



Conversely, let be a bba on X. Its vacuous extension on AT x P is defined 
as: 



^xtxxy(^) 



m^{A) \i B = Ax Y for some A C X, 
0 otherwise. 



(8) 



Another useful notion is that of ballooning extension [8]. Let m^[y\ denote 
the conditional bba on X, given that Y = y. The ballooning extension of wA-[y\ 
on X X y is the least committed bba, whose conditioning on y yields [j/] (see 
[8] for detailed justification). It is obtained for all B C X x P as: 

^Xyi>Xxy/^') ^ f m^[y]{A) if B = (A x {i/}) U (A x {Y\{y})) for some A C A, 
^ ' ) 0 otherwise. 

( 9 ) 
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Let us now consider two bba’s mf and induced by two distinct sources 
of information. If both sources are known to be reliable, they can be combined 
using the (unnormalized) Dempster’s rule of combination, leading to a new bba 
defined as: 



mf (B)mf (C). (10) 

BnC=A 



Finally, the TBM is based on a two level mental modals: the credal level where 
beliefs are entertained and represented by belief functions, and the pignistic level 
where decisions are made. The pignistic transformation maps a bba to a 
probability measure BetP^ on X, defined as: 



BetP^(A) = ^ 

BQX 



m^{B) \AnB\ 

1 — m^(0) \B\ ’ 



C X. 



( 11 ) 



3.3 Application 

Available Data. Let us go back to the problem presented in Sect. 2, but let 
us now assume that, for each failure mode Xi {i = 1 ,... ,n), the abatement 
rate a*, the failure rate Aj and the repair time Ti are not known precisely. Let 
5i = {a~ , of, af) be a triangular fuzzy number defining a flexible constraint on 
ai, and let [A“,A)^] and [T~,T^] denote interval- valued assessments of Xi and 
Ti, respectively. Furthermore, the probability distribution (qj) of input concen- 
trations is no longer assumed to be known. Instead, we more realistically assume 
that a finite sample Cin,i, ■ ■ ■ , Cin,K has been observed. 

Discretization of Input and Output Spaces. To translate this information in the 
TBM framework, we first have to define the underlying variables expressed on 
suitable finite domains. In the classical approach, the output concentration was 
discretized in two categories: water either meets or does not meet a fixed quality 
limit. However, this approach is too restrictive and results in a loss of informa- 
tion. To refine this discretization, we now define £+ 1 thresholds k = 0, . . . 
which induce £ + 2 possible states Sfc, A: = 0, . . . , £ -I- 1 for the output water (see 
Fig. 3). In practical cases, tool can manage values of £ up to about 10, which is 
sufficiently accurate for real cases. For a given mode Xi of the treatment plant, 
the output threshold Uk defines an input threshold 6k^i (the input concentration 
must be less than 0^ ^ for the output concentration to be less than when the 
treatment plant is in mode Xi). In order to recover the classical limit, one of 
the output thresholds must be the norm (N = ak for some k) and the input 
thresholds must at least contain the n -I- 1 values 9k, i obtained with that k and 
all functioning modes i of the treatment plant. However, the discretization can 
take into account more values than this minimal set. We note rjj (0 < j < m) 
the input thresholds arranged in decreasing order and Cj (0 < j < to -I- 1) the 
corresponding input states. Note that we define £ + 1 thresholds for each of the 
n -I- 1 modes, so that to -I- 1 < (n -I- 1)(Z -I- 1) (the upper bound may not be strict 
because some of the thresholds may be equal). 
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Fig. 3. Discretization of input and output concentrations, and fuzzy transfer function 



Frames of Discernment. We thus have three underlying variables: the discretized 
input concentration taking values in E = {eo,... ,6^+1}, the discretized out- 
put concentration taking values in S' = )St-i-i}, and the plant state in 

X = {xq,... ,x„}. We now have to translate the available pieces of informa- 
tion into belief functions on the joint space X x E x S, combine this evidence, 
and marginalize on S to obtain our belief concerning the output concentration 
values. 



Representation of Transfer Functions. The fuzzy abatement rate 5j for operating 
mode i may be seen as defining a fuzzy relation between input and output 
concentrations. This fuzzy relation may be expressed as a possibility distribution 
TTi on variables Cm and Cout defined as a function of the ratio p = Cout / Cm as: 



ro 



if p < Qfj or p > af 



p — a, 



Cout'} — \ 



a? — a 
-I- 

aj - P , 
K of - a? 



3 if < p< 



(12) 



if < p < al 



After discretization of input and output concentrations, induces possibility 
distribution on the product space E x S (see Fig. 3), defined as: 



= sup Tri{Cm,Cout) 

Cin^e.j ,Cout^Sk 



(13) 



Such a possibility distribution is known to be equivalent to a consonant bba 
m^^^[xi] (see [2]). 
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Belief on X. In the classical case, knowledge of failure rates Xi and repair times 
Ti induced a probability function on X using (1) and (2). Since Xi and Ti are 
now only known to lie in given intervals, we have a family of probability 
distributions on X, defined by the contraints p~ < Pi < p^ (i = 0, n), 
with p- = KT~_, pf = (* = ,n), Po = 1 - and 

pj = 1 — X)r=i lower and upper probability of an event A C X are 

given by: 



P (A) = max 





(14) 



P'^{A) = min 




Xi^A 



(15) 



These lower and upper probabilities do not, in general, verify the axioms of 
belief and plausibility measures. However, we may represent this information in 
the belief function framework by the most specific bba (according, e.g., to 
the nonspecificity uncertainty measure [4]), whose set of compatible probability 
functions includes . This may be obtained by solving the following linear 
program: 



L ^ {A)\og\A\ 

d>^A<ZX 



under the constraints: 

hel^{A) < P-{A) < P+{A) < pl^{A), \/A C X. 

Belief on E. The available information on if is of a different nature: it consists of 
a finite sample , Cin^K of sample values. A simple approach to build a 

belief function on E might be to consider the histogram, i.e. to define m^{ej) as 
the relative frequency of observations falling in class Cj . This approach, however, 
is not satisfactory in the small sample case because it does not take into account 
the sample size. Ideally, the inferred belief function should reflect the amount 
of available information, and hence the sample size. One way to achieve this 
goal is to generate B bootstrap replicates of the data [3]. Let Pb{ej) be the 
relative frequency of class ej in bootstrap sample b, and define p~{ej) and p'^{ej) 
as, say, the 1st and 9th deciles of the distribution pt{ej),b = 1,... ,B. We 
then obtain lower and upper probabilities on E, which define a family of 
probability distributions. As before, we may translate this information in the 
belief function format by considering the most specific bba whose set of 
compatible probability distributions includes P^. However, as E may be much 
larger than X, the complexity of this solution might become too high. A simpler 
approach is to restrict the number of focal elements of . For instance, if 
is constrained to be quasi-Bayesian (i.e., to have only singletons and E as focal 
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elements), the solution can be shown to be: 



'^{{ej})=P* (ej) j = 0,... ,m+l 


(16) 


m^{E) = ma,x{p*'^ (cj) — p*~{ej)) 

3 


(17) 


m^{A) = 0 \/AC E,Ay^ E,\A\y^l 


(18) 



with 



m+1 



p* (ej)=max p (e^), 1 - ^ p+(ejv) j = 0, 






m+1 



= min | +(ej), 1-^p (e^v) | j = 0, . . . ,m+ 1 

3 '¥=3 



(19) 

(20) 



Combination and Marginalization. The final step is to combine all the available 
evidence, and marginalize on S. For that purpose, all belief functions must first 
be extended to the product space X x E x S using the ballooning extension for 
m^^^[xi] {i = 0, . . . ,n), and using the vacuous extension for and . The 
resulting belief functions are combined using Dempster’s rule, and the result is 
marginalized on S. Formally, the final bba on S is thus defined as: 

. (21) 

Note that these operations may be performed very efficiently using local com- 
putation algorithms such as the one described in [6]. 



4 Comparison with the Classical Solution 

In this section, we will show that our method yields the same results as the 
classical method where all necessary data are precisely known. When is known, 
has a unique focal element: we have 

m^^^[x,]{Bi) = 1 , ( 22 ) 

where Bi is defined as Bi = Ak^i x {sfc}, with 

Ak^i — — l,i]} 

The ballooning extension of m^^^[xi] yields: 

X B,U {^} X E X S) = 1. (23) 

It can be shown by induction that 

n n 

P|({a;i} X Bi U {xf} x E x S) = [J({xi| x Bi). 

i—0 2=0 



(24) 
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Hence, if we note ^ ^ ^ ^ obtain: 

mf ^|J{xJ X = 1. (25) 

In a second step, we have to express our beliefs concerning the raw water and 
plant states. In the classical case, and are probability functions defined 
by TO^({xj}) = Pi (i = 0, . . . ,n) and m^{{ej}) = qj (j = 0, . . . ,m + 1). The 
vacuous extension of these bba’s on the joint space yields: 

^xtxxiSxS(-|^j ^ ^ ^ i = 0 ,...,n (26) 

^istXxBxS(^ ^ |g^.| ^ j = o,...,m+l. (27) 

Let denote the combination of with . We have 

m^^^^^{{xi}xBi)=pi i = 0,...,n. (28) 



Finally, let denote the combination of with . By con- 

struction, there is only one k such that ej € Ak,i, considering a fixed state i of 
the plant. Consequently, the resulting focal sets are all of the form: 

^+1 

({xi} X [J (Ak,i X {sfc})) n (X X {cj} X S) = {xi} X {ej} x {s^} (29) 

k^O 

with Cj € Akj. Hence, m3 is a probability function. 

In order to conclude this demonstration and to show the equality between 
our approach and the classical one, we will now study the particular case devel- 
oped in the classical method presentation, which consists in considering only one 
threshold for treated water, which must be the norm JV. We then only have two 
states for treated water sq and si and, consequently, two states for raw water 
AQ i and Hi j. In this case, we have 



m 



XxEx 

3 



■®(Aj) =Piqj 



with 



n. . - / ^ ^ if J < * 

\ {xi\ X {e/} X {si} if j > i. 

We then finally obtain, after marginalization: 



2=0 j — 0 



(30) 



(31) 



(32) 



which completes the proof. 
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5 Simulations 

Figure 4 shows simulation results, for one normal mode xq, one failure mode xi, 
and seven output concentration thresholds Sk,k = 0,... ,6. The three graphs 
correspond to three states of knowledge: 

1. In the first case (upper left), the abatement rates, failure rate and latency 
time are known: 



CTo — 0.8 CTi — 0.4 



Ai = 2 X 10"^ h"^ Ti = 4 X 24 h. 

2. In the second case (upper right), the abatement rates are still known, but 
the failure rate and latency time are only bounded by: 

A- = 10"3 h"\ A+ = 2 X 10"3 h"^ 

Tf = 4 X 24 h, T+ = 8 X 24 h. 

3. In the third case (lower graph), the same knowledge of Ai and Ti is assumed, 
and the abatement rates are constrained by fuzzy numbers: 

ao = (0.79, 0.8, 0.81), 5i = (0.39, 0.4, 0.41). 

As expected, the induced belief measure on S' is a probability measure when 
all necessary data are available. In that case, the solution is identical to that of 
the classical fault tree approach. In contrast, results with the TBM approach 
degrade gracefully when the imprecision in input data increases, the pignistic 
probability getting closer to the uniform distribution. 

6 Conclusion 

In this paper, an original method allowing to take into account data uncertain- 
ties in risk assessment for drinking water production process has been described. 
In order to evaluate compliant water unavailability, this method takes into ac- 
count resource quality, characteristics of treatment plant, and different operating 
modes of the treatment plant. Belief functions are used to describe expert knowl- 
edge of treatment steps efficiency, failure rates, times to repair and raw water 
quality. By combination of these belief functions, it is possible to define the belief 
to produce compliant water. In the case where all data are precisely known, this 
approach was shown to be equivalent to the classical fault tree analysis. 

Preliminary validation showed that this approach fits well the experts needs, 
by allowing to simulate various process line scenarios to propose the best tech- 
nical option, based on available partial information. 
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Fig. 4. Results obtained in case 1 (upper left), case 2 (upper right) and case 3 (down). 
The white bars correspond to the belief given to each singleton of S. The black lines 
indicate the pignistic probabilities, and the full bars (white and grey parts) show the 
plausibities 
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Abstract. To overcome the frequent criticism of Dempster’s rule for 
combination of belief functions several alternatives were defined, the 
consensus operator among them. Algebraic analysis of the consensus 
operator is presented using the methodology introduced by Hajek- Valdes 
for Dempster’s semigroup. The methodology and Dempster’s semigroup 
is recalled. Jpsang’s semigroup and related structures are introduced, 
analysed, and compared with those related to the Dempster’s case. 

Keywords: belief functions, Dempster-Shafer theory, combination 
of belief functions, Dempster’s rule, Dempster’s semigroup, consensus 
operator, Jpsang’s semigroup, expert systems 



1 Introduction 

Ever since the publication of Shafer’s book A Mathematical Theory of Evidence 
[12] there has been continuous controversy around the so-called Dempster’s rule. 
The purpose of Dempster’s rule is to combine two beliefs into a single belief that 
reflects the two beliefs in a fair and equal way. 

Dempster’s rule has been criticised mainly because highly conflicting beliefs 
tend to produce counterintuitive results. This has been formulated in the form 
of examples by Zadeh [15], Cohen [1], and Daniel [2] among others. The problem 
with Dempster’s rule is due to its normalisation which redistributes conflict- 
ing belief masses to non-conflicting ones, and thereby tends to eliminate any 
conflicting characteristics in the resulting belief mass distribution. Some people 
criticize the rule also for the fact that it is not defined for combining of ’totally’ 
conflicting pieces of evidence. An alternative called the non-normalised Demp- 
ster’s rule proposed by Smets [13] avoids this particular problem by allocating 
all conflicting belief masses to the empty set. The idea is that conflicting belief 
masses should be allocated to this missing (empty) event. 

Unfortunately even the non-normalised version does not solve all the dis- 
advantages of Dempster’s rule. Thus several other alternatives were suggested 
later. Among the newest ones belongs the consensus operator [10,11], which is 
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developed with the intention to combine highly conflicting beliefs better. The 
consensus operator forms part of subjective logic described by Jpsang in [10]. 

An algebraic structure of binary belief functions with Demspter’s rule ©, 
called Dempster’s semigroup, has been in detail studied in a series of publica- 
tions, e.g. [8,9]. The appearance of the consensus operator ©, which is recently 
developped to overcome the disadvantages of the Dempster’s rule, is a motiva- 
tion for studying of algebraic structures of belief functions with © to obtain a 
better theoretical comparison of both approaches. 

The next section briefly recalls the basic definitions. The algebraic analysis 
of Dempster’s semigroup, which is used as a methodology for the presented 
investigation, is overviewed in the third section. 

Section 4 brings basic ideas and facts about the opinion space to prepare us 
for introduction of the consensus operator in the consecutive section. 

In Sect. 6, a new algebraic structure - the algebraic structure of binary belief 
functions with the consensus operator © - is defined. The new structure called 
J0sang’s semigroup is analysed there. The results are discussed and compared 
with those of Dempster’s semigroup in Sect. 7. 

In the end, some ideas for future research are outlined as well. 



2 Preliminaries 

Let us recall some basic algebraic notions and some basic notions from the 
Dempster-Shafer theory before we begin a description of its algebra. 

A commutative semigroup (called also an Abelian semigroup) is a structure 
X = (A,©) formed by the set X and a binary operation © on A which is 
commutative and associative (x (B y = y (B x and x (B {y (B z) = {x (B y) (B z 
holds for all x,y, z € A). A commutative group is a structure X = (A, ©, — , o) 
such that (A, ©) is a commutative semigroup, o is a neutral element {x(Bo = x) 
and — is a unary operation of the inverse {x © —x = o). An ordered Abelian 
(semi)group consists of a commutative (semi)group X as above and a linear 
ordering < of its elements satisfying monotonicity {x <y implies x (B z < y (B z 
for all x,y,z G A). A subset of A which is a (semi)group itself is called a 
sub (semi) group. A subsemigroup ({a;|a; > o,x G A},©,o) is called the positive 
cone of the ordered Abelian group (OAG) A, similarly a negative cone for x < o. 
An ordered semigroup X = (A,©, <) is Archimedean if for any x,y G X there 
exists a natural number n such that y < nx, where nx = x©...©x (n summands). 

For uncertainty processing, we extend an OAG with extremal elements T and 
T representing True and False, T©x = T, T©x = T, T©T not defined.^ 

^ Some examples are OAG^ PP = ([0, l],©pp,l — x, |,<) and MC = 

([—1, 1], ©MC, — , 0, <) corresponding to the combining structures of the classical ex- 
pert systems PROSPECTOR and EMYCIN, see [8], where x(Bppy = 
and X ©MC y = x + y - xy ior x,y > 0, x + y + xy for x,y < 0 and for 

xy < 0. 
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A homomorphism p : (A, ©i) — > (^, ©2) is a mapping which preserves 
structure, i.e. p{x ©1 y) = p{x) ©2 p{y) for each x,y £ X. Morphisms which also 
preserve ordering of elements are called ordered morphisms, see [7] . 

Ordered structures and ordered morphisms are very important for a compar- 
ative approach to uncertainty management and decision making. 

Let us consider a two-element frame of diseernment 0 = {0,1}. A basic 
belief assignment is a mapping m : V{0) — > [0, 1], such that 
m(0) = 0. Each subset AC 0 such as m{A) > 0 is called a focal element of m. 
A belief function is a mapping bel : V{0) — > [0,1], bel{A) = X05 ^jcca 
I n our special case bel{l) = to( 1), 6e/(0) = m(0), bel{{Q, 1}) = m(l) + m(0) + 
m(|0, 1}) = 1. Each basic belief assignment determines a d-pair (m(l),m(0)) 
and conversely, each d-pair determines a basic belief assignment. 

The Dempster’s (conjunctive) rule of combination is given as 
{beli@bel 2 ){A) = Xxnv=A 1 ^wi(A)to 2(E), where AT = 

Specially for (toi(1), toi( 0)) = (a, &), (to 2(1), TO2(0)) = (c, d) we have 

(,a, Oj © tC, aj — (i i_(ad-|-hc) ’ ^ l-{ad+bc) )■ 

If all the focal elements are singletons (i.e. one-element subsets of 17) then we 
speak about a Bayesian belief function. A dogmatic belief function is defined by 
Smets as a belief function for which m(l7 = 0). Let us note that trivially every 
Bayesian belief function is dogmatic. 

A Bayesian transformation is a mapping t : BcIq — > Proba, such that 
bel{x) < t{bel){x) < 1 — bel{x). Thus a Bayesian transformation assigns a 
Bayesian belief function (i.e. probability function) to every general one. The 
fundamental example of Bayesian transformation is the pignistic transformation 
introduced by Smets. 



3 On the Dempster’s Semigroup 

Definition 1. A Dempster’s pair (or d-pair) is a pair of reals such that a,b > 0 
and a + b < 1. A d-pair (a, b) is Bayesian if a -\-b = 1, (a, b) is simple if a = 0 
or b = 0, in particular, extremal d-pairs are pairs (1,0) and (0,1). (Definitions 
of Bayesian and simple d-pairs correspond evidently to the usual definitions of 
Bayesian and simple belief assignments [8,12]). 



Definition 2. Dempster’s semigroup^ Dq = (Dg,©) is the set of all non- 
extremal Dempster’s pairs, endowed with the operation © and two distinguished 
elements 0 = (0,0) and 0' = (^, 5), where the operation © is defined by 



(a,b) © (c, d) 



(l-g)(l-c) ^ _ (1 - b){l - d) 

1 — (ad + 6 c) ’ 1 — (ad + be) 



^ A generalization of a notion of the Dempster’s semigroup is described in [9], see also 
[8]. The resulting algebraic structure is called a dempsteroid. It has a similar relation 
to the Dempster’s semigroup as OAG has to PP or MC. 




Algebraic Structures Related to the Consensus Operator 335 



1 i = (0, 1) 





Fig. 1. Dempster’s semigroup. Homomorphism h is in this representation a projection 
to group G along the straight lines running through the point (1, 1). All the Dempster’s 
pairs lying on the same ellipse are mapped by homomorphism / to the same d-pair in 
semigroup S 



hi{a,b) = 2^ 

a-\-h—a^ — h^ — ab \ 

l_a2_^2 )■ 



h'> 



Definition 3. For (a, b) G Dq we define 
~{a,b) = (b,a), 

= (a,5)©0' = ( 2 ^^, 
f{a,b) = (a,b) © (b,a) = ( a+fe-_aj-_fa^-ab ^ 

For (a, b), (c, d) G Dq we further define 
( 0 ',b) <(^ (c,d) iff hi{a,b) < hi{c,d) or if hi{a,b) = hi{c,d) and a < c. 

Let G denote the set of all Bayesian non-extremal d-pairs. Let us denote the 
set of all simple d-pairs such that & = 0 (o = 0) as Si (S 2 )- Furthermore, put 
S = {(a, a) : 0 < a < 0.5}. (Note: h{a,b) is an abbreviation for h{{a,b)), etc.) 



Theorem 1. 

(i) The Dempster’s semigroup with the relation is an ordered commutative 
semigroup with neutral element 0; 0' is the only nonzero idempotent of it. 
(ii) The set G with the ordering <0 is an ordered Abelian group 
(G, ©, — , O', < 0 ) which is isomorphic to the PROSPECTOR group PP 
(cf. [8]) and consequently isomorphic to the additive group of reals with 
usual ordering. 

(Hi) The sets S,S\ and S 2 with the operation © and the ordering <0 form 
ordered commutative semigroups with neutral element 0, and are all iso- 
morphic to the positive cone of the MYCLN group MC. 

(iv) The mapping h is an ordered homomorphism of the ordered Dempster’s 
semigroup onto its subgroup G (i.e. onto PP}. 

(v) The mapping f is a homomorphism of the Dempster’s semigroup onto its 
subsemigroup S (but it is not an ordered homomorphism). 
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Using the theorem, see (iv) and (v), we can express^ 

(a © &) = h~^{h{a) © h{b)) n f~^{f{a) © f{b)). 

4 The Opinion Space 

Let us briefly recall some notions from [10,11] before the definition of the con- 
sensus operator. Let us consider a binary frame of discernment 0 again"^. Let 
0 = {x,x}, where x (resp. x) could be a simple element from an application 
domain or a subset of an original multidimensional frame of discernment 0 q and 
X = &o — X. In the later case, let the belief function on 0 be constructed by 
the method of focusing, see [10,11]. Let us assume a basic belief assignment m 
such that m{x) = b,m{x) = d,m{0) = u. Hence bel{x) = b,bel{x) = d, and we 
can consider 6 as a belief about the truth of a;, d as a disbelief about x (a belief 
about the complement of x), and m= 1 — & — dasan uncertainty® about x. Let 
us further recall a 3-dimensional metric® called opinion^ . 

Definition 4. Let O be a binary frame of discernment containing x and x as 
its elements, let m be a basic belief assignment which defines observer’s belief b 
about x, disbelief d about x (a belief of the complement of x), and uncertainty u 
about X. Let a represent the relative atomicity of x in 0q if O is focused 0q (or 
simply in O if 0 = 0o)i a = (it is | if 0 = 0 q). Then the observer’s 
opinion about x is the tuple: 



oj = (b, d, u, a). 

Thus an opinion ojx represents an observer’s belief, disbelief and uncertainty 
about the truth of x and a relative atomicity Ux of x in the original frame 
of discernment 0 q in the case of focusing. The opinion contains a redundant 
parameter u = 1 — b — d which allows a simple definition of the consensus 
operator, see the next section. Because we consider the only x, we can omit 
indexing of b, d, u, a by x, which is used in the case where focusing given by 
different subsets of 0 is considered. 

The opinion space can be graphically represented by a triangle as shown in 
Fig. 2. 

® Note that in fact. h{x) expresses certainty / uncertainy of belief x, while f~^{f{x))nS 
expresses vagueness / preciseness of x. 

In [10] and [ll] there is described a method of focusing of belief functions on a general 
00 to belief functions on a focused binary 0 such that probabilistic expectations 
remain the same. 

® We use Jpsang’s terminology here. Note that bel(x) = dou{x) is called (degree of) 
doubt of x by Shafer in [12]. u = 1 — b — d corresponds rather to vagueness than to 
uncertainty in Hajek- Valdes. 

® From the mathematical point of view, it is not any metric. It is just an extended 
representation of a binary belief function (belief). 

^ Note that there are used upper indices A, B, C, ... for opinions differing in [10, ll]. 
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Disbelief 



Uncertainty 

4 ^ 



' Director 




Fig. 2. Opinion triangle, ujx is an example of an opinion about x £ O 



As an example the position of the opinion = (0.4, 0.1, 0.5, 0.6) is indi- 
cated as a point in the triangle. The horizontal base line between the Belief and 
Disbelief corners is called the probability axis. As shown in the figure, the prob- 
ability expectation value E(x) = 0.7 and the relative atomicity a(x) = 0.6 can 
be graphically represented as points on the probability axis. The line joining the 
top corner of the triangle and the relative atomicity point is called the director. 
The projector is parallel to the director and passes through the opinion point 
ujx. Opinions situated on the probability axis are called dogmatic opinions, rep- 
resenting traditional probability. The distance between an opinion point and the 
probability axis can be interpreted as a degree of uncertainty. Opinions situated 
in the left of right corner, i.e. with either 6 = 1 or d = 1 are called the absolute 
opinions, corresponding to TRUE or FALSE values in two- valued logic. 

Because the relative atomicity does not play any role in the consensus op- 
erator (it is used for computing of the probability expectation and by another 
operator of Jpsang’s subjective logic), we can omit it as redundant from our 
point of interest®. 



4.1 Analogy of Opinions and d-Pairs 

Trivially, any opinion (b,d,u) gives the unique d-pair (b,d), and analogically 
any d-pair (v,w) gives the opinion (v,w,l — v — w) which is unique if relative 
atomicity is omitted or fixed. We can observe that the absolute opinion (1, 0, 0) in 
the right corner of the opinion triangle (Belief) corresponds to T = (1,0) in the 
notation of the Dempster’s semigroup, while Disbelief (0, 1,0) in the left corner 

® Especially in the case of two simple elements x and x of a domain {0 — 0q, i.e. 
|6>o| = 2), or in the case where \x\ = \x\ £ 0q for |6>o| > 2, there is the fix relative 
atomicity = \, and all the projectors are perpendicular to the probability axis, 
and the probability expectation is equal to the pignistic probability defined in the 
Transferable Belief Model [13,14]. 
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corresponds to _L = (0, 1), and Uncertainty (0, 0, 1) in the top corner corresponds 
to 0 = (0, 0) which is interpreted as total ignorance in the Dempster’s semigroup. 
Analogically the probability axis corresponds to the set G of all the Bayesian d- 
pairs, and the right (or left) arm of the opinion space triangle corresponds to 
i.e. to the set of all simple d-pairs (6,0) (or to S '2 respectively). And the vertical 
median of the opinion triangle connecting (0,0,1) and (|,^,0) corresponds to 
the set S. Using the analogies we will use denotations G,S,S\, and S 2 also in 
the context of the opinion space. 

5 The Consensus Operator 

The consensus of two opinions is an opinion that reflects both argument opinions 
in a fair and equal way, i.e. when two observers have beliefs about the truth of 
X resulting from distinct pieces of evidence about a;, the consensus operator 
produces a consensus belief that combines the two separate beliefs into one. 

Definition 5. Let to a = {hA,dA,UA) and u>b = {bB,dBjUB) be opinions^ re- 
spectively held by agents A and B about the same element x of 0 = {x, x}, and 
let K = UA + ub — uaub- When ua, ub —>-0, the relative dogmatism between 
oja and u>b is defined by 7 so that 7 = ua/ub- Let ujab = {bAB^dAB^UAB) be 
the opinion such that: 

for K yf 0 .• for k = 0 .' 

1 - bAB = {bAUB + bBUA)/K bAB = 

2 . dAB = {dAUB + dBUA)/K, dAB = ^ 

3. UAB = (uaub)/k Uab = 0 . 

Then ujab is called the consensus opinion between oja emd ojb, representing an 
imaginary agent [A, BJ’s opinion about x, as if that agent represented both A and 
B. By using the symbol © to designate this operatoA'^ we define ujab = uja©ujb- 

6 Jpsang’s Semigroup 

Let us turn our attention to an algebra of belief functions on a binary frame of 
discernment (i.e. to an algebra of d-pairs - opinions) with the binary consensus 
operator ©. As is already stated in [10], the consensus operator © is a com- 
mutative and associative operation on the set of all non-dogmatic binary belief 
functions (opinions), hence we can speak about an Abelian semigroup again. As- 
sociativity of consensus of several dogmatic beliefs is more complicated, thus we 
still postpone its discussion and a formal definition of the Jpsang’s semigroup. 

® Let us note that (from our point of view) redundant relative atomicity and indexing 
by X is omitted in this definition, originally from [11], further upper indices A, B 
are substituted by the lower ones. 

© is used in [10,11]. Let us use © here to distinguish the consensus operator © from 
the Dempster’s rule ©. 
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6.1 An Algebraization of the Consensus Operator 



Because we have no more information than beliefs, i.e. opinion and we do not 
expect any additional ones, we have to consider the same approximation of u 
to 0 for all dogmatic opinions. Hence 7=1 and we can express the consensus 
operator as follows: 

ibA,dA,UA) © ^ 

uaub ^ 0 

(bA,dA,0) © (bB,dB,0) = , <^A+dB ^ dogmatic opinions, a con- 

sensus of several dogmatic opinions we will discuss later. 



Lemma 1 . (i) Both the 0 = (0,0,1) and 0' = (|, |,0) are idempotents of the 
consensus operator. 

(ii) All the Bayesian d-pairs (dogmatic opinions)^^ are idempotents with respect 
to the consensus operator. 

(Hi) All the Bayesian d-pairs (dogmatic opinions)^^ are absorbing elements with 
respect to the consensus with non-Bayesian ones. 

(iv) 0 = (0,0,1) is the only non-Bayesian idempotent. 

(v) 0 = (0,0,1) is the neutral element for non-Bayesian d-pairs (opinions). 



Lemma 2. (i) All the subsets G,S,Si, and S2 of the opinion space are closed 
with respect to the consensus operator. 

(ii) Consensus of two opinions is Bayesian iff at least one of the opinions 
consensed (i.e. combined by the consensus operator) is Bayesian. 

(Hi) All the subsets S(^k) = {{b, kb, 1 — (1 -I- k)b) \ {b, kb, 1 — (1 -I- k)b) is opinion} 
of the opinion space are closed with respect to the consensus operator. 

For proofs of these and the following lemmata see [5]. 

Definition 6. Let us define for {b, d, u) from the opinion space the following: 

— {b, d, u) = {d, b, u), 

q{b,d,u) = {b,d,u)@0' = O', 

qo{b, d, u) = q~^{q{b, d, u)) fl (S'! U S'2), where 

qo{b,d,u) = (53^7,0, ^^(^) for b>d, qo(b,d,u) = (^,0, for b < d, 

r{b,d,u) = {b,d,u)@ - (b,d,u) = (^, for u 0, 

r{b, d, 0) = (6, d, 0)©(d, 6, 0) = 0) = (i, i, 0) = 0' . 

For (b,d,u), (6', d', u') G Dq we further define 

(b,d,u) <Q {b',d',u') iff {qo)2{b,d,u) <{qo)2{b',d',u') or if {qo)2{b,d,u) = 
{qo)2{b' ,d' ,u') and{qf]i{b,d,u) < {qo)i{b' ,d' ,u') or if qo{b,d,u) = qo{b' , d' ,u') 
and b < b' , where qo{b,d,u) = {{qo)i{b,d,u), {qo)2{b,d,u), {qo)3{b,d,u)) . 



Lemma 3 . (i) —{—x)=x (i.e. —{—{b,d,u)) = (b,d,u)), 

(ii) -{x@y) = -x@-y (i.e. -{{bi,di,Ui)@{b2,d2,U2)) = ~{bi,di,Ui) @ - 

{b2,d2,U2)), 

Including extremal d-pairs (absolute opinions) TRUE and FALSE. 

Including extremal d-pairs (absolute opinions) TRUE and FALSE. 
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(Hi) —X is not an inverse to x, i.e. the equation {bi,di,ui)@{b 2 ,d, 2 ,U 2 ) = 
(0,0,1) has no solution in the opinion space for (bi,di,Ui) ^ (0,0,1). 

(iv) The mapping q is a trivial ordered homomorphism of the set of all non- 
Bayesian opinions to {O'}. 

(v) The mapping go is an ordered homomorphism of {D\ — G) onto S\ and of 
(£>2 — G) onto S2, where D\ = {(6, d, m) G > d|, D2 = {{b,d,u) G 
Dq I b < d}, but it is not a homomorphism of {Dq — G) onto SiU S2- 

(vi) The mapping r is a homomorphism of all the opinion space onto its sub- 
algebra S (but it is not an ordered homomorphism). 

(vii) The sets S, Si, S2, and S(^k) with the consensus operator and with ordering 
<0 form Archimedean OAGs with neutral element (0,0, 1). They are all 
isomorphic to the positive cone of the MYCIN group MC. 

(via) There is no neutral element in G, there is no inverse on G, i.e. there is 
no relation of G to any group. 

(ix) The set Sq = 5'iUS'2 with operator @Sq = ©oqo, with operator with dis- 
inguished element 0 = (0, 0, 1) and with ordering <q forms Archimedean 
OAG S'o = (‘S'o, ©So, — , 0, <0). Sq is isomorphic to the MYCIN group 
MC. 

6.2 Associativity of Computing of the Consensus of Bayesian 
Opinions 

The consensus operator of non-Bayesian opinions is computed as an increased 

weighted mean. Both belief and disbelief components are weighted by u of the 

other opinion and the resulting mean is increased by a factor > 1, 

i.e. we can express the consensus of non-Bayesian opinions as 

{bi,di,u\) © {b2,d2,U2) = 

( biu^+b^ui ui+u-i diU2+d2Ui ui+uq, U 1 U 2 ] 

Ml+ti 2 — M1M2 ’ «l+ll 2 U1+U2—U1U2'’ U1+U2 — U1U2 ) 

for U1U2 yf 0. While the consensus of Bayesian opinions without any additional 
information corresponds just to non-associative arithmetical mean. To overcome 
it additional tools requiring additional information are used to obtain 7 yf 1, 
see Definition 5. If there is no additional information we have to distinguish 
whether the opinions to be combined are ’single’, i.e. not results of any previous 
applications of the consensus operator, or how many times the consensus has 
been already used. We use 7=1 for two ’single’ opinions, 7 = n in the case where 
(61, di, 0) is a result of just n applications of the consensus operator and (&2, d2, 0) 
is a ’single’ one. In the case where the first argument (6i,di,0) is a ’single’ and 
the second one is already consensed we use 7=7. Hence, the computation of 
the consensus corresponds to stepwise computation of n-ary arithmetic mean. 
For an example of an associative combination of three Bayesian opinions see [6] . 

Using the above procedure, we are able to compute the consensus of several 
Bayesian opinions in an associative way. But this method is not general. We have 
to always remember and handle the history of the opinions (how many times the 
consensus has been used). And it is not always easy. In the case of subjective 




Algebraic Structures Related to the Consensus Operator 341 



opinions it is often even for an opinion agent himself quite difficult to decide 
whether his opinion is ’single’ or is already implicitly consensed (i.e. implicitly 
combined by the consensus operator) from two or several opinions. 



6.3 A Formal Definition of Jpsang’s Semigroup 

From the algebraical point of view we have obtained, instead of the operator 
on opinions, a new one defined on the Cartesian product of the set of opinions 
with the set of positive integers or reals if we admit non-integer 7 based on 
different additional information. Because even this method is not completely 
general, we do not include it into formal definition of Jpsang’s semigroup, and 
we keep limited to non-Bayesian opinions. 



Definition 7. J0sang’s semigroup Jq = (Jo,©) is the set of all non-Bayesian 
Dempster’s pairs (opinions), endowed with the operation © and with a distin- 
guished element 0 = (0,0, 1), where the operation © is defined by 

(bA,dA,UA) © (bB,dB,UB) = ( d^un+dnu^ u^us y 



UA-i-UB—UAUB ' 



Theorem 2. 

(i) z The J0sang’s semigroup with the relation <q is an ordered commutative 
semigroup with neutral element 0 = (0,0,1); 0 is the only idempotent of 
it. 

(a) The sets S, Si, S 2 and S'(fc) with the operation © and the ordering <q form 
Archimedean ordered commutative semigroups with neutral element 0, and 
they are all isomorphic to the semigroup of nonnegative elements (positive 
cone) of the MYCIN group MC. 

(Hi) The set Sq = Si U S 2 with the operations @So = @ o qg and and 
with the ordering <q form an Archimedean OAG with neutral element 0.' 
Sq = (S'o, ©So> — ©)• ^0 is isomorphic to the MYCIN group MC. 

(iv) The mapping go is an ordered homomorphism of the J0 ’s semigroup onto 
group Sq, it preserves the ordering <@ (Sq is a subset of Jq but not a 
subalgebra of Jq ). 

(v) The mapping r is a homomorphism of the J0sang’s semigroup onto its 
subsemigroup S (but it is not an ordered homomorphism). For any k >0, 
r is an ordered isomorphism of onto S. 

(vi) The mapping q is a trivial homomorphic Bayesian transformation of Jq. 
No non-trivial homomorhic Bayesian transformation of Jq exists. No ho- 
momorhic Bayesian transformation of the whole opinion space which is 
homomorphic with respect to the consensus operator exists. 

For proofs see [5]. 

Using the theorem, see (iv) and (v), we can express the consensus © for every 
couple of non-Bayesian opinions x, y as: 

{x®y) = go'^(<?o(a:)©go(j/)) n r~'^{r{x)®r{y)). 
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7 A Comparison of the Jpsang’s Semigroup with the 
Dempster’s One 

Both the algebraic structures have the following similarities: 

Both of them are ordered Abelian semigroups with neutral element (0,0,1)- 
They have the same unary operation minus — which is not inverse in both the 
cases. Both the structures have subsemigroups 8 , 81,82 with neutral elements. 
Both of them have a surjective (i.e. onto) homomorphism Dq — > 8. We can 
define group 80 on subsets S'o = S'! U 82 of both the structures (with © o /iq, 
and <0, in the case of Dq, while with © o qq, -, and <q in the case of Jq). In 
both the cases there exists a surjective ordered homomorphism onto group 8q. 
Both the operations © and © are expressible using the pair of homomorphisms. 

Differences: 

The Demspter’s semigroup is defined on all non-extremal d-pairs, © is not defined 
for T©T, while the Jpsang’s semigroup is defined on non-Bayesian d-pairs only, 
i.e. on Dq — G. On the other hand, the consensus operator © is defined on the 
whole extended Dq but it is necessary to use additional information to obtain 
its associativity. S'® = (So, ©Soi 0; <©) is isomorphic to G = (G, ©, — , O', <0) 
while S® = (So, ©So, — , 0, <©) collapses to {O'}. 

© forms an Archimedean OAG on G, while the behaviour of © is completely 
different on G: © is not associative on the set G, 0' = (5, ^) is not a neutral ele- 
ment, all Bayesian opinions are idemponent and they are absorbing with respect 
to non-Bayesian ones, extremal elements (absolute opinions) are not absorbing 
with respect to Bayesian ones. 

We have to remember a different interpretation of uncertainty here. In the 
Dempster’s semigroup certainty / uncertainty of d-pair x is defined as h{x), 
especially for Bayesian d-pair y = {b, d, 0) the value b is just the certainty / 
uncertainty of y. Value f{x) corresponds to the vagueness / impreciseness of x. 
The Bayesian d-pairs are precise, while 0 = (0,0,1) is the vaguest d-pair. This 
corresponds also to general consideration of probability as a tool for uncertainty 
processing. 

In the opinion space interpretation, Bayesian opinions have no uncertainty, 
they are considered to be certain. And uncertainty increases with the distance 
from the set of Bayesian opinions. 

The principal is the following: 

© combination of any two elements (d-pairs / opinions) is on an ellipse further 
from 0 (closer to G), and similarly, © combination of any two elements is on a 
straight line (r-line) further from 0. I.e. the measure u=l — b — dis decreased 
by the combination, regardless of its interpretation (vagueness / uncertainty). 
Both combinations © and © of two elements >0 0' (>© 0' respectively) or of two 
ones <0 0 (<© 0) are on homomorphic straight lines (/i-lines, g-lines) further 
from 8 . In the case of the Dempster’s semigroup, we can interpret it so that the 
big values (close to (1, 0, 0), d-pairs >0 O') are increased (closer to (1, 0, 0)), while 
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the small values are decreased (closer to (0, 1, 0)). It is caused by the cumulative 
nature of the Dempster’s rule ©. There is no such an interpretation in the case 
of the Jpsang’s semigroup. It is caused by an averaging nature of the consensus 
operator ©. 

8 Conclusions and Perspectives 

A new algebraic structure - Jpsang’s semigroup - is defined on a binary frame of 
discernment. The Jpsang’s semigroup and related structures are analysed in this 
text. It is compared with the analogically constructed Dempster’s semigroup. 

The analysis of the algebraic nature of the consensus operator moves us on to 
better and deeper understanding of this operator and also deeper understanding 
of combining several beliefs in general. 

The main theoretical disadvantage of the present state of the consensus ope- 
rator is its non-associativity on dogmatic beliefs. This problem has been already 
partially solved by using additive information, see an example of associative 
consensus of three dogmatic beliefs in [6]. On the other hand, a theoretically 
clean associative consensus of several dogmatic beliefs is still an interesting open 
problem. 

Another interesting topic for future research is a comparison of the focusing 
of a frame of discernment introduced by Jpsang, see [10,11], with the approach 
of refinement / coarsening of a frame of discernment which has been suggested 
in [3] and used in [4]. 
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Abstract. Twenty years after their inception, intuitionistic fuzzy sets 
are on the rise towards making their “claim to fame” . Competing along- 
side various other, often closely related, formalisms, they are catering 
to the needs of a more demanding and rapidly expanding knowledge- 
based systems industry. In this paper, we develop the notion of a graded 
inclusion indicator within this setting, drawing inspiration from related 
concepts in fuzzy set theory, yet keeping a keen eye on those particular 
challenges raised specifically by intuitionistic fuzzy set theory. The use of 
our work is demonstrated by its applications in approximate reasoning 
and non-probabilistic entropy calculation. 



1 Introduction and Problem Definition 

1.1 Putting Intuitionistic Fuzzy Set Theory on the Map 

IFS theory basically enriches Zadeh’s fuzzy set theory with a notion of inde- 
terminacy expressing hesitation or abstention. While in the latter, membership 
degrees, identifying the degree to which an object satisfies a given property (gen- 
erally speaking), are taken to be exact, in the former extra information in the 
guise of a non-membership degree is permitted to address a commonplace feature 
of uncertainty. Imagine, for instance, a voting procedure in which delegates have 
to express their feelings w.r.t. a number of proposals. It is obvious that while 
one can be in favour or in disfavour of a proposal to a certain extent, one can 
also abstain from the vote; an attitude inspired by, e.g., a lack of background 
or interest, or simply because no obvious arguments for or against the cause at 
stake have been raised. In such a situation, using only a [0, l]-valued degree a 
expressing support for the proposal is arguably too committing. A similar argu- 
ment can be set up when the opinion of a given voter is not (fully) known, and 
we should be duly hesitant to classify him as a supporter or an opponent of the 
proposal. 

IFS theory allows for an easy, yet elegant, way out of such problems by not 
insisting that membership and non-membership to a set be strictly complemen- 
tary properties. In an IFS A defined in a universe^ X, alongside a membership 

^ For simplicity, throughout this paper X is assumed to be finite. 
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degree ^jla{x) of x to A, we also distinguish a non-membership degree va{x), 
such that fiA{x) + i^a(x) < 1- Note that a fuzzy set in X is then just an IFS for 
which ^a{x) + va{x) = 1 holds for every x. The degree tta{x) = 1 — — 
quantifies the degree of indeterminacy associated with x and A. 

Just like the relationship between classical logic and set theory was exploited 
in fuzzy set theory to define “fuzzy logics” (in a narrow sense), so we may also 
introduce a notion of “intuitionistic fuzzy (IF) logics”; with a proposition P a 
degree of truth and one of falsity vp may be associated, such that np+vp < 1. 

This idea is elaborated in e.g. [1]. 

As it turns out, IFSs pop up quite naturally. Attempts to embed IFS theory 
within more “familiar” frameworks have shown that they fit in with, and enrich, 
a well-established tradition of modeling imprecision rather than setting off on 
an entirely new course, which marks their relevance. It can easily be seen, for 
instance, that IFSs are formally equivalent to interval-valued fuzzy sets: indeed, 
a couple {^a{x), va{x)) may be mapped bijectively onto an interval [ijla{u), 1 — 
va{u)]. Some would consider this syntactical equivalence sufficient evidence to 
dismiss IFS theory as superfluous and giving cause to unnecessary confusion. We 
raise two arguments against such allegations: 

1. Interval- valued fuzzy set theory is currently associated, de facto, with the 
work of Mendel and others on type-2 fuzzy logic systems (see e.g. [13]). 
That setting is characterized by a probability-like treatment of uncertainty 
on the membership degrees in a fuzzy set; an interval- valued fuzzy set is 
designated as a special type-2 fuzzy set^ that exhibits a uniform spread 
of uncertainty on the membership degrees. IFS theory, however, does not 
make any assumptions on the nature of its indeterminacy - it merely gives 
a quantitative representation of “missing information” . 

2. We consider IFS theory as a stepping stone in a larger context that is specifi- 
cally tuned to the concept of positive and negative constituents, rather than 
lower and upper approximations. Indeed, if we relax the constraint that 
fJ,A{x) and va{x) sum up to at most 1, letting either degree range freely in 
[0,1], we arrive in the realm of fuzzy four- valued logics first suggested by 
Stickel [15] and given a nice practical application by Fortemps and Slowin- 
ski [9], who used degrees {a, (3) G [0,1]^ whose respective components ex- 
press positive and negative evidence in a preference setting. It is clear that 
as soon as a -I- /3 > 1, evidence is inconsistent to some extent. In that sense, 
IFS theory can be seen as the consistent restriction of the fuzzy four-valued 
framework. 

Unfortunately, also a lot of misunderstandings concerning terminology have 
sprung up. The term “intuitionistic” is to be read in a “broad” sense here, 
alluding loosely to the denial of the law of the excluded middle on element 
level (since ^j.a{x) + va{x) < 1 is possible). A “narrow”, graded extension of 
intuitionistic logic proper has also been proposed and is due to Takeuti and 
Titani [17] - it bears no relationship to our notion of IFS theory. 

i.e. a fuzzy set whose membership degrees are themselves fuzzy sets in [0, 1] 



2 
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1.2 An Introduction to Graded Inclusion Measures 



In fuzzy set theory, inclusion is, by default, defined as follows: for A and B fuzzy 
sets^ in a universe X, A C B 4=^ (Vx G X){A{x) < B{x)), i.e. A C B ii and 
only if the graph of A fits beneath the graph of B. A natural extension of this 
definition to IFS theory reads, for A and B IFSs in X: A C B 4=^ fJ'A Q 
and vb Qv A- 

While in many theoretical and practical settings this two-valued character- 
ization of subsethood suffices, it could be argued that the definition is overly 
restrictive: just as an element can belong to a fuzzy set to varying degrees, so we 
may also want to talk about a fuzzy set being “more or less” a subset of another 
one. Many researchers [2,7,8,11,12,14,18] have tried to capture this intuition by 
proposing concrete operators Inc that take a couple of fuzzy sets {A, B) as their 
input and return a value Inc{A, B) in [0, 1] indicating the degree of subsethood 
of A to B. 

Typically, to define fuzzy subsethood one takes a definition of classical set 
inclusion and tries to extend (“fuzzify”) it to apply to fuzzy sets. Below we quote 
three distinct, but essentially equivalent^, definitions of the inclusion of A into 
B, where A, B G ViX): 



AC B 4=^ (Vx G X){x e A^ X e B), 
lAnB 



A = 



or 



co(A) U B 



A 



= 1 



= 1 , 



( 1 ) 

(2) 

( 3 ) 



While (1) is stated in strictly logical terms, the other two are based on count- 
ing the elements of a set, i.e. on cardinality, and have a probabilistic (i.e. fre- 
quentist) touch about them. It is therefore not surprising that their respective 
generalizations to fuzzy set theory cease to be equivalent. Without going into 
the details at this point, we might roughly state that adepts of the different crisp 
definitions have put fuzzy subsethood on two separate tracks, one logic-based, 
the other frequency-based. One situation where this distinction comes to light is 
when one tries to mould fuzzy inclusion measures into axiomatic characteriza- 
tions by listing desirable properties for them, as several authors have attempted. 
The most strident dissonance (see e.g. Young [18] on this) seems to concern the 
condition 

A, B €V(X) ^ Inc(A,B) e {0,1} ( 4 ) 

called heritage by Kitainik [11]. As will be revealed later on, choosing to impose 
it pretty much forces us into the logic-based approach, although useful trade-offs 
are possible. 



® For simplicity, we identify a fuzzy set A with its membership function fiA and write 
A{x) to denote ha{x). 

Arguably, (1) is more general since it can also deal with infinite sets. 
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In this paper, we are going to pursue this discussion to the framework of 
IFS theory. Our aim is twofold: first we are going to try and convey as complete 
and uniform as possible a picture of IF inclusion by contrasting and generalizing 
corresponding fuzzy approaches; and secondly, we will highlight a few distin- 
guishing features that are specific only to the extension, and which are meant to 
refute the criticism that IF subsethood assessment merely amounts to applying 
fuzzy inclusion measures twice. 

The paper is organized as follows: in Sect. 2, we recall the necessary mathe- 
matical background on IFS theory. Section 3 starts by investigating what an IF 
inclusion measure should look like, and what properties it should ideally satisfy. 
This results in the development of logic- and frequency-based approaches. In 
Sect. 4, we briefly sketch the application of these measures in two concrete do- 
mains: approximate reasoning and entropy measurement. Finally, Sect. 5 offers 
a brief conclusion. 

2 Preliminaries of Intuitionistic Fuzzy Set Theory 

Atanassov [1] gives the following definition of an IFS A in X: 

A= {{x,fiA(x),iyAix)) \ X G X} (5) 

where fXA and va are called membership and non-membership function of A 
respectively. They satisfy ha{x) + va{x) < 1 for every x G X. The class of all 
IFSs in X is denoted XT(X). 

This definition is easy to absorb for humans but lacks mathematical con- 
ciseness. Just as a fuzzy set in X can be interpreted as a mapping from X 
to [0, 1], so we may define an IFS A in A as a mapping from X to the set 
L* = {{xi,X 2 ) G [0, 1]^ I -I- X2 < 1}. Moreover, equiping L* with an ordering 
<L* defined as {xi,X2) <l* (2/1, 2/2) xi<yi and X2 > 2/2, {L*,<l*) assumes 
the structure of a complete, bounded lattice with greatest element = (1,0) 
and smallest element Ol* = (0, 1). The sup and inf operations on this lattice are 
derived from </,» as: 

sup((xi,7/i), (a;2,2/2)) = (max(a;i, X2), min(yi, 7/2)) (6) 

inf((a;i,7/i), (a;2,2/2)) = (min(xi, X2), max(yi, 2/2)) (7) 

The intersection, union and complement of IFSs A and B in XT{X) are 
defined by, for x = {x\,X 2 ) G L* , AC\B{x) = inf(A(a:), iJ(a:)), A\J B{x) = 
sup(A(x), i?(x)), co{A){x) = A(x 2 ,xi). Thus, IFSs are a special case of L-fuzzy 
sets in the sense of Goguen [10], with L = L*. As a shorthand notation, for 
X G L*, we denote its first, resp. second component by xi and X 2 - A special 
subset D of “fuzzy values” of L* is defined by I? = {{xi,X 2 ) G L* \ x\ = 1 — X 2 }- 

Since <l* is a partial order, an order-theoretic extension of classical negation, 
conjunction, disjunction and implication on L* , as negators, triangular norms 
and conorms, and implicators, respectively, arises quite naturally: a negator on 
L* is any decreasing L* — >• L* mapping J\f that satisfies A/’(0 l.) = and 
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A/’(1 l») = Ol*. The mapping Afs, defined as A/’s(xi, X2) = (x2, a^i), V(a:i, X2) G 
L*, will be called the standard negator. 

A t-norm on L* is any increasing, commutative, associative (T*)^ — >■ L* 
mapping T that satisfies = x, for all x € L*; a t-conorm on L* 

is any increasing, commutative, associative (L*)^ — >■ L* mapping S satisfying 
S{0l*,x) = X, for all x G L* . Obviously, the greatest t-norm with respect to 
the ordering is inf, while the smallest t-conorm w.r.t. <l* is sup. Note that 
it does not hold that for all x,y € L*, either inf(a;,7/) = a; or inf(x,y) = y. 
For instance, inf((0.1, 0.3), (0.2, 0.4)) = (0.1, 0.4). t-norms and t-conorms can 
be partitioned into two classes by the following definition: a t-norm T on L* 
(resp. t-conorm S) is called t-representable if there exists a t-norm T and a t- 
conorm S on [0, 1] (resp. a t-conorm S' and a t-norm T' on [0, 1]) such that, 
for x = {xi,X2),y = (2/1, 1/2) G L*, T{x,y) = {T{xi,yi), S{x2,y2)),S{x,y) = 
(S'{xi,yi),T'{x2,y2))', T and S (resp. S' and T') are called the representants 
of T (resp. iS). Clearly, inf and sup are t-representable. The following mappings 
Tw and Sw , called IF Lukasiewicz t-norm and t-conorm, are not [4]: 

Tw{x,y) = (max(0,a;i + yi - l),min(l,a;2 + 1 - 2/i, 222 + 1 - xi)) (8) 
Sw{x, y) = (min(l, xi -I- 1 - 2/2, 2/i + 1 ~ ^2), max(0, X2 + y2~ 1)) (9) 

It can be verified that Tw is a t-norm on L* and Sw a t-conorm on L*; their 
existence rules out the conjecture, implicit in most of the existing literature that 
t-norms and t-conorms on L* are necessarily characterized by a pair of fuzzy 
connectives. 

Negators, t-norms and t-conorms on L* may be used to define generalized 
versions of complementation, intersection and union of IFSs. Specifically, we may 
define coj^{A), A C\j- B and A U7- B by coj^{A){x) = M{A{x)), A ("17- B{x) = 
T{A(x), B{x)), A U5 B{x) = 5(A(a;), B{x)), where x G X. 

The final and for our purposes most important construct is that of an implica- 
tor on L*: an (L*)^ — >■ L*-mappingI satisfying 1(0^*, 0/,.) = 1 l*,T(1l*, Ol*) = 
0 l.,I( 0 l*, 1 l*) = 1 l*,I( 1 l*, 1 l*) = 1 l*. Moreover we require I to be decreas- 
ing in its first, and increasing in its second component. This definition is very 
general; as in fuzzy set theory, we may distinguish implicators on L* w.r.t. their 
construction. Explicitly, an S-implicator Is.M is defined as, for x,y G L* : 

= ‘5(A/'(a;),2/) (10) 

with S a t-conorm and Af a negator on L* . An i?-implicator I7-, generated by a 
t-norm T on L* is defined as, for x,y G L* , hy 

^r{x,y) = sup{7 G L* I T{x,y) <L- y} (11) 

These two classes contain most of the prominent implicators. For example, 
the S-implicator of Sw and Afs, equal to the R-implicator of Tw is given by 

^Sw,Ms =^Tw(x,y) = (min(l,2/i-kl-a:i,a;2-kl-2/2),max(0,2/2 + a:i-l)) (12) 

Other than by their construction, implicators on L* may also be classified 
by the properties they satisfy. The following important theorem is based on [4] . 
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Theorem 1. A continuous implicator X on L* satisfies 

{Vx,y,z€ = mf{I(x,y),I{x, z)))) (13) 

{Vx € L*){I{l,x) = x) (14) 

(Va;,i/G L*){X{Afs{y),Afs{x)) =X{x,y)) (15) 

(Vx, y,zG L*){T{x,T{y, z)) = X{y,X{x, z))) (16) 

{'ix,y & L*){x <L* y X{x,y) = Il>) (17) 

iyx.y & L*){x = II* and y = Ql* <1=^ I(a;, y) = 0^.) (18) 

X{D,D)CD (19) 

iff there exists a continuous increasing permutation^ <j) o/[0, 1] s. t., forx,y G L* , 

y) = min(l, 1 + ip{yi) - ip{xi), I + ip{l - 7 / 2 ) - ~ X2)), 

1 - min(l, 1 - ip{xi) + (p{\ - 7 / 2 ))) (20) 



To conclude this section, the cardinality of an IFS in X was defined by Szmidt 
and Kacprzyk [16] as the couple (min T'Co77nt(^), max XC'o77.nt(2l)), where 

min £'C'o77.nt(2l) = piA{x) (21) 

x^X 

Tndx.ECount{A) = E 7T^(x) — ^ ^ ( 1 ^A{xf^ (22) 

x^X xCiX 



3 Construction of IF Inclusion Measures 

In this section, we study different strategies to come up with reasonable subset- 
hood indicators Xnc for IFSs. A first question that needs to be answered is what 
kind of a mapping Xnc should be: evidently, its inputs are IFSs in X, but what 
kind of object should its output be? Since we have been speaking about graded 
inclusion indicators, the natural option seems to be just a number in [0, 1]; the 
following example, however, shows that this strategy can lead to anomalies. 

Example 1. Let A,B he IFSs in X = {xi,X 2 \, such that A{xi) = 

B{xi) = (0,0), A(a; 2 ) = (0,0), B{x 2 ) = 0^*. Obviously, A B. Yet, due to 
the indeterminacy w.r.t. B and x\, and w.r.t. A and X 2 , there is no indication 
that A is not a subset of B at all, nor can it he argued that A is a subset of B to 
a given extent a G [0, !].• in fact, it could be a subset, to a certain extent, of B, 
hut the presence of maximal indeterminacy does not allow to cut the knot! In this 
sense, forcing Xnc{A, B) to be in [0, 1] is too committing, and a more natural 
way to express A’s inclusion into B is by the element (0,0) G I*, exploiting it 
to express the same kind of indeterminacy as it does for partial membership to 
a set: we simply cannot tell. 

® It can be verified that this is equivalent to the existence of a permutation ^ of L*, 
where ${x) = — — such that I = o {<!>, $). For this reason, 

T is also called a ^-transform of the R-implicator of the IF Lukasiewicz t-norm. 
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This example suggests that Xnc be an XT{X) x XT(X) — >• L* mapping. It 
also presents a criterion for IF inclusion measures without an analog in fuzzy 
set theory, namely that Inc(A, B) = (0, 0) when A and B are as in the above 
test case. On the other hand, we can borrow substantially from the available 
literature on fuzzy inclusion measures, as we will see shortly. 

A convenient way to derive IF inclusion measures is to list a number of 
criteria for them, and then find out which operations satisfy these conditions. 
In fuzzy set theory, such an approach was taken by Sinha and Dougherty [14], 
and independently also by Kitainik [11]. Although neither linked their results 
explicitly to one of the formulas (1-3) defining crisp subsethood, subsequent 
research [7] pointed out that they implicitly invoked (1), and thus the use of an 
implicator on [0, 1], by their insistence on the heritage property (4). For now, we 
will take this property for granted; a convenient working set of criteria is then 
given as: 



(11) 

( 12 ) 

(13) 



(14) 



Contrapositivity Xnc{A, B) = Xnc{coB, coA) 

Distributivity Xnc{A, B C\C) = inf(/nc(A, B),Xnc{A, C)) 
Xnc{A, B) = Xnc{S{A),S{B)) 
a) Xnc{A,B) = 1^* AC B 
h) Xnc{A, B) = Ol* 

(3a; € X){A{x) = and B{x) = 0^*) 
c) A,B G X{X) ^ Xnc{A, B) G D 



Symmetry 

Faithfulness 



where A,B,C G X!F{X) and S an XT{X) -G XX{X) mapping defined by, for 
X G X, S'(A)(a;) = A(s(a;)), with s a permutation of X. 



Historically, these conditions go back to different sources®: the first three re- 
quirements were adopted from Kitainik’s work on fuzzy inclusion measures [11], 
while the two faithfulness conditions (I4a-b) are due to Sinha and Dougherty. [14] 
The heritage property is a consequence of (I4a-b). Finally, we added another 
faithfulness condition to ensure that Xnc, when applied to fuzzy information, 
acts like a fuzzy inclusion measure. It can be verified that a mapping satisfying 
(11-14) is decreasing in its first, and increasing in its second component. The fol- 
lowing theorem gives an explicit characterization. Its proof draws its inspiration 
from [7], [8] and [11]. 

Theorem 2. An X!F{X) x XTijC) -G L* mapping Xnc satisfies (I1)-(I4) iff 

Xnc{A,B)= ini X{A{x), B{x)), (23) 

x^X 

withX an IF implicator satisfying properties (13), (15), (17), (18) and (19). 

It is interesting that in order to be compliant with the test case of example 1, 
X should also satisfy (14). Few candidates X fulfill all requirements; theorem 5 



For a detailed account of the various links between Kitainik’s and Sinha and 
Dougherty’s approach, and their unification, we refer the interested reader to [7]. 
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characterized all continuous mappings complying with these stringent conditions, 
i.e. the (^-transforms of the R-implicator of the Lukasiewicz t-norm Iw- The 
simplest of these uses = x for all x £ L* and will be called Xncj-^ : 

Tncrw{A,B) = inf 2ru,(A(a;),R(a:)) (24) 

x£X 

A byproduct of this result is that it forces us to reject the argument 
that a graded subsethood assessment for IFSs could be reduced trivially to 
assessing e.g. Incj-^ and Incj-y^ {vb, va), which are both in D by (I4c). 

Indeed, for the test case in example 1, both of these are equal to Oz,., whereas 
Xncj\y{A, B) = (0,0). In other words, determining subsethood for IFSs does 
not amount to a mere double application of a fuzzy inclusion measure, as some 
IFS critics suggest! 

Let us focus again on the heritage condition (4) . In her paper on fuzzy subset- 
hood, Young [18] raises skepticism about it: she reasons that much of the relative 
structure of fuzzy sets, and by extension IFSs, is lost when imposing it; indeed, 
if two fuzzy sets A and B are equal everywhere, except in the point x for which 
A{x) = 1 and B{x) = 0, (4) forces Inc{A, B) = 0. One can think of very concrete 
instances in which this indeed makes no sense. Imagine for instance that we are 
to evaluate to what extent the young people in a company are also rich. Testing 
subsethood of the fuzzy set of young workers into that of rich workers should 
then be based on the relative fraction (i.e. the frequency) of good earners among 
the youngsters, and not on whether there exists or does not exist one poor, 
young employee. This observation has led researchers to consider extensions to 
definition (2) of crisp subsethood, which works well for fuzzy inclusion measures. 
Indeed, if A and B are fuzzy sets, then e.g. min SCount(A) = max SCount{A); 
putting |A| = min BCount{A), one can define the subsethood of A into B as 
the ratio of \A fl B\ and |A| if A yf 0, and 0 otherwise (see e.g. [12]). Unfortu- 
nately, there is no straightforward extension to IFS theory, since IF cardinalities 
are intervals of positive real values; resorting to interval calculus is not a viable 
option, either, as the following example shows. 

Example 2. Given two strictly positive real intervals [a, 6] and [c, d], interval 
calculus defines their ratio as 



[oA 

[c,d] 



a b 
d’ c 



Define f, for IFSs A and B in X, as 

[min SCount{A fl R), min SCount{A fl B)] 



f{A,B) = 



[min S C ount ( A) , max AC ount ( A) ] 



(25) 



(26) 



Let X = {x}, A{x) = (0.5, 0.3), B{x) = (0.6, 0.2). Then ACB, but f{A,B) = 
= [f,|] yf [1;1]- It ts also unclear how to associate f{A,B) with an 
element of L* . 
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Evidently, the extension of (2) by interval calculus is problematic and appears 
to be due to definition (25). If however in (25) we put c = d, the result will 
be an interval in [0, 1]. This property is particularly useful when one considers 
the alternative definition (3) of crisp inclusion. Indeed, using the definition of 
generalized IF union and complement, we can compute 






[min SCount{coj\f{A) U5 B),max SCount{coj^{A) U5 B)] 
[min BCount{X), max. BCount{X)] 



(27) 



Let fs^j\f{A, B) = [ci,C2[. Since mmXCount(X) = maxECount{X) = |X|, 
(ci,l — C2) G L*. Recalling definition (10) of an S-implicator on L* , we may 
therefore introduce the following class of inclusion measures Xncs^j^'- 



Bncs,M{AB) 




X! (^S.At(ci(a:), B(a;)))i, 



1 

m 



'^{Xsj^{A{x),B{x)))2 

x^X 



J2Xsx(A{x),B{x)) 



(28) 



The second equality introduces a convenient shorthand notation. 



An example at hand is XncsyyXs ■ satisfies all of (11-13), (I4a) and (I4c), but 
not (I4b). In that sense, it has a much more lenient behaviour w.r.t. the young 
and rich employee problem. In fact, it can be seen that it bears a close relation- 
ship to Xnc'j-yy-. instead of taking the infimum of all values X'j-^{A{x), B{x)), it 
computes their “average” , which allows for a much greater deal of compensation. 



4 Applications of IF Inclusion Measures 

4.1 Inclusion-Based Approximate Reasoning 

Roughly, approximate reasoning is concerned with the deduction of imprecise 
conclusions from imprecise premises. In the context of IFS theory, an IF if- 
then rule is a construct with the generic form “If Vi is A then V2 is R” where 
Vi and V2 represent an input and an output variable, respectively, and A and B 
are normalized^ IFSs in the universes X of Vi and Y of V2. Typically, then, the 
system is presented with an observation on the input variable of the form “Vi is 
A'” with A' not necessarily equalling A, and asked to derive a suitable IFS B' 
such that“V2 is R'”. One way of obtaining B' is by applying the Compositional 
Rule of Inference [6] : the if-then rule is paraphrased by an IF relation® R from 
X to F, i.e. an IFS in A x F, and B' is computed by taking the 07--composition 
of R and A', defined by, for y G Y 

B'{y) = Rot A'{y) = sup T{A'{x),R{x, y))) (29) 

x^X 

^ An IFS A in A is called normalized if there exists at least one x G X such that 
A(x) = 1l*. 

® Typically, R{x,y) = X(A(x), B(x)) for some implicator X on L* or R{x,y) = 
X{A{x), B{x)) for some t-norm T on L* . 
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This calculation is computationally costly. In [6] , a procedure called inclusion- 
based approximate reasoning was shown to approximate B'(y) in particular 
cases. 

Theorem 3. If B'{y) = svl];>Tw{A' B{ y))) , then 

xex 

21rw(21ncrv^(A',T),S(y)) >l- B'(y), and Incr^iB' , B) >l* Tncr^iA' , A) . 
Additionally, if the range of B is L* , then Incj-yy {B' , B) = Incj\y {A' , A) . 

The approximation is much easier to calculate, since it bypasses the ex- 
pensive supremum operation. Some promising examples (so far only in fuzzy 
set theory) have demonstrated the practical use of inclusion-based approximate 
reasoning. [5] 

Note also that the theorem uses , which satisfies the controversial prop- 

erty (I4b). Another setting in which this measure appears is in the calculation 
of lower approximations in IF rough set theory as developed in [3] . 

4.2 Entropy of IFSs 

In fuzzy set theory, a common task is to determine the amount of fuzziness in 
a given fuzzy set (see e.g. [12,18]). A measure of fuzziness, also called entropy 
measure, is taken to express the extent to which a crisp distinction between the 
elements belonging and not belonging to a fuzzy set is lacking, i.e. the extent to 
which all the elements’ membership degrees are close to 0.5. In IFS theory, first 
steps were taken to define a measure of entropy by Szmidt and Kacprzyk in [16]. 

To Szmidt and Kacprzyk, an IF entropy measure on X is an XTiX) — >■ [0, 1] 
mapping E satisfying 

1. E{A) = 0 4=^ A G V{X) 

2. E{A) = I 4=^ y,A = VA 

3. (Vx G X){y,A{x) < fasix) < vb{x) < va{.x) or ^^(x) > hb(x) > vb{x) > 

va{x)) ^ E{A) < E{B)) 

4. E{co{A)) = E{A). 

These conditions are faithful extensions to those imposed on fuzzy entropy 
measures®. Still, we wish to adjust these requirements in the following respects. 
First, in a similar vein as for IF inclusion measures, it can be argued that entropy 
of IFSs cannot be reasonably captured by just one number and is better expressed 
by elements of L* . For instance, if (Vx G A)(A(x) = (0,0)), then E{A) should 
be equal to (0,0), since no information is available on the fuzziness of A. This 
also explains our feeling that requirements 2. and 3. are too strong, and should 
be replaced by E{A) = 1l* (Vx G A)(/x^(x) = va{x) = 0.5) and (Vx G 

X){y,A{x) < fiB^x) < 0.5 < vb{x) < va{x) or y,A{x) > fJ,B{x) > 0.5 > vb{x) > 
va{x))^E{A) <i. E{B)). 

In fact, we feel that IF entropy should reflect the range of situations that 
could occur if the indeterminacy in A were to disappear, that is: if tta{x) were 
distributed between ijla{x) and va{x)- This idea is illustrated by an example. 

® Replacing va by co{pLA) in the above definition, we can obtain them. 
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Example 3. Let A be the IPS in X = {xi,X 2 } defined by A{xi) = (0.2, 0.3), 
A{x 2 ) = (0.1, 0.1). Themrj^{xi) = 0.5 andTTA{x 2 ) = 0.8. The “least fuzzy” fuzzy 
set obtainable by distributing tta among ha and va is A' defined by A'(xi) = 
(0.2, 0.8), ^'( 0 : 2 ) = (0.9, 0.1). The “most fuzzy” fuzzy set derived in this way is 
A” defined by A''{x\) = (0.5, 0.5), A" (^ 2 ) = (0.5, 0.5). Applying an arbitrary 
fuzzy entropy measure E on X to A' and A", we have E{A') < E{A''). The 
“real” fuzziness of A is therefore somewhere in the interval [E{A'),E{A")], and 
hence could be represented equivalently by {E(A'), 1 — E{A")) G L* . 

Kosko [12] was the first to link the entropy of a fuzzy set A to the degree to 
which A U co{A) is included into A fi co{A), using specific definitions for E and 
Inc. It is therefore interesting to investigate whether IF entropy measures like 
the one suggested in the above example can be obtained at all using IF inclusion 
measures. The following theorem is a very nice affirmation of this conjecture, 
showing at once the use of frequency-based IF inclusion measures and of t- 
representable connectives on L* . 

Theorem 4. Let £(A), for A G TT{X), be defined as 

£{A) = Lncs,AffiA\J co{A), ACi co{A)) (30) 

where S{x,y) = (min(l,a;i -I- j/i), max(0, a ;2 + 2/2 — !))• Then £ satisfies the 
modified conditions of Szmidt and Kacprzyk, and 

£{A) = {E{A')A-E{A")) (31) 

where A' and A” are fuzzy sets in X such that for x G X , A'{x) = max(l — 
P,a{x),1~'‘^a{x)), andA''{x) = max(0.5, /iyi(a;), jza(x)), and E is a fuzzy entropy 
measure, defined for a fuzzy set B in X by 

(32) 

I I x(^X 



It can be verified that defining e.g. £(A) = (^ U co{A),A fl co{A)), 

a similar decomposition is not possible. 

5 Conclusion 

This paper has studied various approaches to the definition of intuitionistic fuzzy 
inclusion measures. We have attempted to reconcile various requirements posed 
by fuzzy set theory with the specific indeterministic nature of IFSs. This has 
resulted in two essentially different types of IF inclusion measures, although 
both are dependent on an implicator on the evaluation set L* . Future work 
should focus on the particular meaning of the individual degrees of the result. 
The example on IF entropy has already shown that under specific conditions on 
the operations involved a very attractive interpretation can be endowed to the 
result. 




356 



C. Cornells and E. Kerre 



Acknowledgements. The authors would like to thank the anonymous referees 
for their critical analysis of the paper. Chris Cornelis would like to thank the 
Fund for Scientific Research-Flanders for funding the research elaborated on in 
this paper. 



References 

1. Atanassov, K.T.: Intuitionistic Fuzzy Sets. Physica-Verlag, Heidelberg, New York 
(1999) 

2. Handler, W., Kohout, L. Fuzzy power sets and fuzzy implication operators. Fuzzy 
Sets and Systems, 4 (1980) 13-30 

3. Cornelis, C., De Cock, M., Kerre, E.: Intuitionistic fuzzy rough sets: on the cross- 
roads of imperfect knowledge. Accepted to: Expert Systems (2003) 

4. Cornelis, C., Deschrijver, G., Kerre, E.E.: Implication in intuitionistic and interval- 
Valued fuzzy set theory: construction, classification, application. Submitted to: 
International Journal of Approximate Reasoning (2002) 

5. Cornelis, C., Kerre, E.E.: A Fuzzy Inference Methodology Based on the Fuzzifica- 
tion of Set Inclusion. Recent Advances in Intelligent Paradigms and Applications, 
(A. Abraham, L. Jain, J. Kacprzyk, eds.), Physica-Verlag (2002) 71-89. 

6. Cornelis, C., Kerre, E.E.: On the structure and interpretation of an intuitionistic 
fuzzy expert system. In: Proceedings of EUROFUSE 2002 (B. De Baets, J. Fodor, 
G. Pasi, eds.) (2002) 173-178. 

7. Cornelis, C., Van Der Donck, C., Kerre, E.E.: Sinha-Dougherty approach to the 
fuzzification of set inclusion revisited. Fuzzy Sets and Systems, 134 ( 2 ) (2003) 
283-295. 

8. Fodor, J., Yager, R.: Fuzzy set theoretic operators and quantifiers. Fundamentals 
of Fuzzy Sets (D. Dubois, H. Prade, eds.), Kluwer, Boston, Mass. (2000) 125-193 

9. Fortemps, P., Slowinski, R.: A graded quadrivalent logic for ordinal preference mod- 
elling: Loyola-like approach. Fuzzy Optimization and Decision Making 1 (2002) 
93-111 

10. Goguen, J.: L-fuzzy Sets. J. Math. Anal. Appl. 18 (1967) 145-174. 

11. Kitainik, L.: Fuzzy decision procedures with binary relations. Kluwer, Dordrecht, 
The Netherlands (1993) 

12. Kosko, B.: Fuzzy entropy and conditioning. Kluwer, Dordrecht, the Netherlands 
(1993) 

13. Mendel, J.M.: Uncertain rule-based fuzzy logic systems. Prentice Hall PTR, Upper 
Saddle River, New Jersey (2001) 

14. Sinha, D., Dougherty, E.R.: Fuzzification of set inclusion: theory and applications. 
Fuzzy Sets and Systems, 55 ( 1 ) (1993) 15-42. 

15. Stickel, M.E.: Fuzzy four-valued logic for inconsistency and uncertainty. Proceed- 
ings of the Eighth International Symposium on Multiple- Valued Logic (1978) 

16. Szmidt, E., Kacprzyk, J., Entropy for intuitionistic fuzzy sets. Fuzzy Sets and 
Systems, 118(3) (2001), 467-477. 

17. Takeuti, G., Titani S.: Intuitionistic fuzzy logic and intuitionistic fuzzy set theory. 
Journal of Symbolic Logic, 49 ( 3 ) (1984) 851-866. 

18. Young, V.: Fuzzy subsethood. Fuzzy Sets and Systems, 77(3) (1996) 371-384. 




A Random Set Model for Fuzzy Labels 



Jonathan Lawry^ and Jordi Recasens^ 

^ Department of Engineering Mathematics 
University of Bristol 
Bristol, UK 
j . lawrySbris .ac.uk 

^ Seccio Matematiques i Informatica - ETSAV 
Univ. Politecnica de Catalunya 
Pere Serra 1-15. Sant Cugat del Valles, Barcelona, Spain 
recasens@ea.upc . es 



Abstract. A random set semantics for fuzzy labels is proposed in which 
we model the vagueness of fuzzy concepts in terms of their level of ap- 
propriateness as descriptions for values. This random set model is then 
shown to be characterised by a certain axiom system for appropriateness 
measures. It is then shown how some t-norms can generate appropriate- 
ness measures and an attempt is made to identify a family of t-norms 
that can be used consistently for this purpose. The calculus that is in- 
troduced is functional but not truth-functional. 



1 Introduction 

The use of fuzzy labels to describe system variables and parameters has proved 
to be a highly effective modelling tool applied in a wide range of applications. 
Despite this there remain a number of unresolved fundamental issues associated 
with fuzzy methods in general and truth-functional fuzzy logic in particular. 
Central to these is the problem that fuzzy logic lacks an agreed operational 
semantics providing a clear interpretation of membership functions and justifying 
the truth-functionality assumption. Indeed, it is interesting to observe that many 
of the most contoversial aspects of fuzzy logic are a direct consequence of truth- 
functionality, including the failure to satisy the law of excluded middle and 
the fact that (classically) equivalent expressions may have different membership 
degrees (see Elkan [2] and associated replies for an interesting discussion of these 
two properties). Intuitively, however, we expect the meaning of compound fuzzy 
expressions to be captured entirely by the meaning of the basic fuzzy labels 
from which they are generated. This suggests that some form of functionality 
should be a feature of any calculus for fuzzy labels. In addition, there are, of 
course, good practical arguments for functionality in terms of efficiency of both 
representation and reasoning. We claim, however, that these requirements can 
be met by a type of functionality much weaker than truth-functionality. 

In the sequel we propose a model for fuzzy labels based on random sets. This 
differs from previous random set interpetation such as those proposed by Good- 
man and Nguyen (see [4], [5]) in that the random sets are defined on subsets of 
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labels rather than subsets of parameter values. Within this framework we define 
a measure of appropriateness of a label to a value and show how this can be 
extended to compound logical expressions. A type of functionality is then in- 
troduced by means of a mass selection function. This random set model is then 
shown to be characterised by a certain axiom system for appropriateness mea- 
sures. We then show how some t-norms can generate appropriateness measures 
and we attempt to identify a family of t-norms that can be used consistently for 
this purpose. 



2 Label Semantics 

For a variable x into a domain of discourse we identify a finite set of words 
LA = {Li, . . . , Ln} with which to label the values of x. Then for a specific value 
X € f2 an individual I identifies a subset of LA, denoted T>^ to stand for the 
description of x given by I, as the set of words with which it is appropriate to 
label X. Within this framework then, an expression such as ‘the diastolic blood 
pressure is high’, as asserted by I, is interpreted to mean high G where bp 
denotes the value of the variable blood pressure. If we allow I to vary across a 
population of individuals V then we naturally obtain a random set T>x from V 
into the power set of LA where T>x{I) = T>1,. A probability distribution (or mass 
assignment) associated with this random set can be defined and is dependent on 
the prior distribution over the population V . We can view the random set T>x 
as a description of the variable x in terms of the labels in LA. 

Definition 1. (Label Description) For x € D the label description of x is a 
random set from V into the power set of LA, denoted T>x, with associated dis- 
tribution nix, given by 

VA C LA mx{S) = Pv{{! eV:Vi = S}) 

where Py is the underlying distribution on V. 

Another high level measure associated with mx is the following quantification 
of the degree of appropriateness of a particular word L G LA as a label of x. 

Definition 2. (Appropriateness Degrees) 

Vx G 17, VT G LA pl{x) = ^ mx{S) 

SCLA:LeS 

We now extend the notion of appropriateness degrees from labels to com- 
pound descriptions generated as logical combinations of labels. 

Definition 3. Label Expressions 

The set of label expressions of LA,LE, is defined recursively as follows: 

(i) Li € LE for i = 1, . . . ,n 

(a) If 9, ip € LE then ~^9, 9 A (p,6 \/ (p,9 ^ ip G LE 
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In the context of this assertion-based framework we interpret the main logical 
connectives in the following manner: Li A L 2 means that both Li and L 2 are 
appropriate labels, L 1 VL 2 means that either Li or L 2 are appropriate labels and 
-<L means that L is not an appropriate label. More generally, a label expression 
9 identifies a set of possible label sets \{9) as follows: 

Definition 4. For L G LA \{L) = {S C LA : L G S} and for label expressions 
0 and (p 

(i) VL* G LA \{Li) = {SC LA\L, G S} 

(ii) \{9 A (f) = X{6) n \{p) 

(Hi) X{0 y tp) = X{9) U A(v3) 

(iv) A(-6») = Xiff) 

(v) X{6 -A p) = A(-'0) U A(<p) 

Intuitively, X{9) corresponds to those subsets of LA identified as being possible 
values of by expression 9. In this sense the imprecise linguistic restriction ‘a; 
is 9' on X corresponds to the strict constraint T>^ G X{9) on T>^. Hence, we can 
view label descriptions as an alternative to linguistic variables [9] as a means of 
encoding linguistic constraints. 

The notion of appropriateness degree given above can now be extended so 
that it applies to compound label expressions. The idea here is that pe{x) quan- 
tifies the degree to which expression 9 is appropriate to describe x. 

Definition 5. Compound Appropriateness Degrees 

For 9 a label expression and x G D the appropriateness of 9 to x is given by: 

pe{x) = ^ mx{S) 
s&\{e) 

3 Mass Assignments and Appropriateness Degrees 

In general, for LA = {Li,...,L„} the values p.L^{x), . . . , p,L^{x) determine 
an infinite set of possible mass assignments for satisfying the constraints 
SscltI-l eS “ HLi(x) for i = 1, . . . ,n. In terms of random set theory 
this means that the appropriateness values correspond to the fixed point cov- 
erage of the random set [4]. Now clearly this property will be problematic 
for the practical application of a label semantics based calculus. To determine 
rrix directly we must have total knowledge of V and Py which is unlikely to be 
the case and even if such information is available then general inference on ap- 
propriateness degrees would require the storage of 2” — 1 pieces of information. 
Clearly, this is not feasible even for moderately large values of n. However, we 
now argue that there is a case for assuming a functional relationship between 
PLiix ), . . . and m^. 

Given that Dx encodes the use of labels across a population of individuals 
who are able to communicate and use these labels to convey information then 
we would expect the variation of Dx between individuals to be strictly limited. 
This in turn suggests that we should only consider candidates for mx with a 
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somewhat restricted structure. Indeed, if the nature of the variation of T>x is 
sufficiently restricted we might expect that a unique solution to the above con- 
straints would be identified. This would mean a functional relationship between 
. . . , and rrix- Such a functional relationship is formalised by the 

following notion of a mass selection function: 

Definition 6. Mass Selection Function 

A mass selection function is a function A : [0, 1]" — >■ where is the 

set of all possible mass assignments on 2^^, satisfying 

X! (x),..., Min (x)} (S) = p,Li (x) for i = l,...,n 

S-.LieS 

In effect then a mass selection function provides sufficient information regard- 
ing the nature of the variation of T>x to uniquely determine the mass assignment 
mx from the values of the appropriateness degrees on LA. Furthermore, since the 
value of p,e(x) for any expression 9 can be evaluated directly from rrix then given 
an appropriate mass selection function we need no longer have any knowledge 
of the underlying population V but rather we need only define appropriateness 
degrees for L G LA corresponding to the fuzzy definition of each label. 

Example 1. Examples of Mass Selection Functions 

The Consonant Mass Selection Function 

Given appropriateness degrees yiLiix), . . . , p.L„(x) ordered such that HLi(x) > 
Lbi+i (x) for i = 1, . . . , n — 1 then the consonant mass selection function identifies 
the mass assignment, 

{Li, . . . ,L„} : fJ.L„{x), {Li , . . . ,LJ : - fJ.Li+i{x) i = 1 and 

0:1- fTLiix) 

The Independent Mass Selection Function 

Given appropriateness degrees piLi{x), . ■ . , p,L„{x) then the independent mass 
selection function identifies the mass assignment: 

\/SCLA m-D^S) = n Hl{x) X rf(i-ww) 

Les L^S 

4 A Calculus for Appropriateness Measures 

In this section we propose an axiomatic foundation for measures of appropriate- 
ness and investigate its relationship to the random set model proposed above. 
In particular, we investigate the role of t-norms as generators of measures of 
appropriateness. The following definition proposes a system of axioms for ap- 
propriateness measures on LE x 17: 

Definition 7. Appropriateness Measures 

An appropriateness measure on LE x f2 is a function /i : LE x 17 — >■ [0, 1] such 
that Vx G 17, 0 G LE p.e{x) quantifies the appropriateness of label expression 9 
as a description of value x and satisfies: 
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AMI V0 G LE if \= 0 then Va; G 17 fieix) = 1 
AM2 'iOjif G LE if 9 = ip then Va; G 17 pe{x) = p,p{x) 

AMS V0, Lp G LE if 1= -i(0 A p) then Vx G 17 P 0 y^{x) = pg{x) + p^{x) 

AM4 V0 G LE there exists a function fg : [0,1]" — >■ [0,1] such that 
Vx G 17 pg{x) = /e(^Lj(x), . . . ,/ri„(x)) 

A possible justification for these axioms is as follows: AMI can be justified on the 
basis that a tautology can be applied as a description to any value. AM2 states 
that the level of appropriateness is invariant under logical equivalence. In other 
words, if two expressions have the same meaning then they should be equally 
appropriate to describe any value. We can argue for AMS on the basis that if 
two expression can never both be appropriate descriptions then the appropriate- 
ness of their disjunction should be the sum of their respective appropriateness 
degrees. Taken together AM1-AM3 ensure that for a fixed x an appropriateness 
measure defines a probability measure over LE. AM4 captures the intuition 
that the meaning of compound expressions should be completely determined by 
the meaning of the fuzzy labels in LA as represented by their appropriateness 
degrees. We now show that this system of axioms can be characterised by the 
random set model introduced in Sects. 2 and 3 taken in conjunction with a mass 
selection function. 

Definition 8. Logical Atoms 

(i) Let ATT denote the set of logical atoms of LE (i.e. all expressions of the 

form a = ±L J 

(ii) Let ATTg = {a G ATT\a h 0} 

(Hi) yS C LA as = {Asies ^ 

Lemma 1. Let Val denote the set of valuations on LA and let t : Val — >■ 2^^ 
such that Vu G Val t{v) = {Li : v{Li) = true} then 

'i9 G LE A(9) = {t(v) : v(9) = true} 



Lemma 2. 

V6» G LE ATTg = {as : 5 G A(6<)} 

Proof 

(^) 

Let Va & Val he defined such that VL G LA Va{L) = true if and only if a \= L 
Suppose a G ATTg then Va(9) = true => T{va) G {r(u) : v{6) = true} 
t(uq) G X{9) by Lemma 1 

Now letting S = T{va) then a = a$ and therefore, a G (as : S G A(0)} 

Suppose a = as for some S G A(0) then 

3v € Val : v{9) = true and S = t(v) by Lemma 1 

a G ATTg since v{9) = true if and only if aT{v) G ATTg 
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Lemma 3. 'iO, ip G LE 9 = ip if and only if X{9) = \{(p) 

Lemma 4. If 9 G LE is a tautology then X{9) = 2^^ 

Proofs of Lemmas 1, 3 and 4 are given in [6] 

Theorem 1. Characterization Theorem 

fi is an appropriateness measure on LE x 12 iff \/x € f2 there exists a mass 
assignment nix on 2 ^^ such that 

mx = (a;) • ■ • Ll„ (x)) 

for some mass selection function A and 

G LE ixg{x) = mx{S) 

SeA(e) 



Proof 

(^) 

By Lemma if \= 9 then p, 0 {x) = '^scla xnx{S) = 1 and hence AMI holds. 
\/9, (p G LE : 9 = ip we have by Lemma 3 that 

Vx G f? /ie(x) = ^ mx{S) = ^ mx{S) = 

Se\(9) Se\{v) 



Hence, AM2 holds. 

If\=-'{9A(p) then by Lemma 4 and Definition 4 X{9Xip) = 0 => X(9)nX((p) = 0 
hence \/x € f 2 

P- 0 xcp{x) = ^ mx{S) = ^ mx{S)+ ^ mx{S) = p.g{x) + fi,p{x) 

sex{9)u\{(p) sex{9) sex(ip) 

Hence AM3 holds. 

M9 G LE fj,g{x) = f 0 {fj,Li(x), . . . where 

f9{LLAx),---,fJ'L„{x)) = A{p.LAx),---,9-Lr,{x)){S) 

S&X{9) 



Hence AM4 holds. 

(^) 

By the disjunctive normal form theorem for propositional logic it follows that 
G LE 9 = VaeATTfl ^ therefore by AM2 and AM3 we have that: 

Wx G 12 p, 0 {x) = p,/ \{x) = ^ Pa{x) 

\yai=ATTg'^j aeATTe 

Now Vx G 12, VS' C LA let 

mx{S) = Pas(x) 
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then clearly rux defines a mass assignment on 2^"^. Also mx can be determined 
uniquely from according to the mass selection function 

A{fJ,Li{x),...,fiLAx)){S) = fasifJ'Liix),. . . ,fiLAx)) 

where fag '■ [Oj 1]" [Oj 1] is the function identified for as by AM4- 
Finally, by Lemma 2 

\/9 G LE qte{x) = ^ yta{x) = ^ qLag{x) = ^ mx{S) 

a^ATTe SSA((9) SSA(e) 



as required 

The above theorem shows that appropriateness measures (Definition 7) cor- 
respond to compound appropriateness degrees (Definition 5) where the mass 
assignment mx is determined using some mass selection function. It is impor- 
tant to realize that while appropriateness measures are functional, no non-trivial 
appropriateness measures are truth-functional. To see this note that by AMI 
appropriateness degrees satisfy the laws of excluded middle. However, from a 
theorem due to Dubois and Trade [1] it is know that any truth-functional logic 
satisfying the law of excluded middle can only be binary. Hence, non-binary 
appropriateness measures cannot be truth-functional. 

There are a number of clear connections between appropriateness measures 
and Shafer-Dempster theory [8]. Suppose we simply view mx as a conditional 
mass assignment on 2^^ given value x. In this case, for any k < n labels 
Ti, . . . , Lfc the appropriateness of the disjunction Li V • • • V as a description 
of x is given by: 



MiiV-vLfc(a^) = X! 'mx{S) = Pl{{Li,...,Lk}\x) 
S:{Li,...,Lk}nS^tl 

Similarly the appropriateness of the conjunction Li A • • • A as a description 
of x is given by: 



hLiA-AL^(x) = ^ mx{S) = Q{{Li,...,Lk}\x) 

S-.{Li,...,Lk}CS 

Here Q denotes the commonality function for mx where for subset S, Q{S) rep- 
resents the total mass that can be moved freely to every element of S [8]. This 
latter relationship is used in the following proposition to show how appropriate- 
ness measures can be charaterized by t-norms. 

Proposition 1. If yi is an appropriateness measure on LEx f2 then there exists 
a t-norm At such that 

Vx G Vi? C LA (x) = At{yiL{x) : L G i?)^ 

^ For notational elegance we take At (/in (a;)) = hl { x ) 
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if and only if there exists a valid mass assignment mx satisfying Vx € f2, 

\/9 G LE g.g{x) = J2s&\{0) where 

ySCLAmx{S)= ^ At (ml(x) : L G T) 

T-.TDS 

Proof 

Since /x is an appropriateness measure then by Theorem 1 it follows that there 
exists a valid mass assignment mx satisfying 

mx{S) = Q{R\x) 

^ S-.SDR 

Then Vi? C LA Q{R\x) = At(/Xi(x) : L € R) if and only if 

ySCLAmx{S)= Y, At (/xl(x) : L G T) 

T-.TDS 

by Shafer’s inversion formula (see [8]) as required. 

We refer to appropriateness measures of the above form as being generated by 
a particular t-norm. An axiomatic definition is as follows: 

Definition 9. Appropriateness Measures Generated by a t-norm 

The appropriateness measure on LE x 17 generated by t-norm At is the unique 

function fi : LE x 17 — >■ [0, 1] such that Vx G 17, 0 G LE /ig(x) satisfies: 

AMTl V0 G LE if \= 0 then Vx G 17 fxg{x) = 1 
AMT2 V0, (f G LE if 9 = ip then Vx G 17 fig{x) = iXip{x) 

AMTS V6*, ip G LE if \= -^{9 A ip) then Vx G 17 fj,0\/ip{x) = fxg{x) + pL,p{x) 
AMT4 Vi? C LA, Vx G 17 (x) = At(^L(x) : L G R) 

Theorem 1 and Proposition 1 show that appropriateness measures generated by 
a t-norm are characterized by a particular form of mass selection function. In [6] 
and [7] we have shown that the appropriateness measures generated by minimum 
and product t-norms are based on the consonant and independent mass selection 
functions respectively. 

It is important to note however, that not all t-norms, generate fully con- 
sistent appropriateness measures. Indeed several well known t-norms generate 
appropriateness measures that are only consistent assuming certain constraints 
on the appropriateness degrees over the labels. Intially, we formally define the 
notions of universal and local consistency for appropriateness measures. 

Definition 10. Universally Consistent Appropriateness Measures 
The appropriateness measure /x based on t-norm At is said to be univerally con- 
sistent if and only if for all values of pi Li(x) G [0,1] : i = l,...,n it holds 
that 

MS C LA mx{S) = Y (-1)'^”^' 

T-.TDS 



At (/ xl ( x ) : L G T) > 0 
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Both min and product generate universally consistent appropriateness mea- 
sures (see [6] and [7]). 

Definition 11. Locally Consistent Appropriateness Measures 
The appropriateness measure p, generated by t-norm At is said to he locally con- 
sistent if and only if there exists a non-empty open subset A C [0, 1]" such that 
if {fJ-Liix), . . . ,Pl„(x)) G a then 

\/S C LA m^{S) = (-1)'^”^' At {pl{x) : L G T) > 0 

T-.TDS 

In [7] it is shown that the appropriateness measure generated by the t-norm 
Af(yi, j/ 2 ) = max(0, j/i -I- t /2 — 1) is only locally consistent on the subset 

n n 

{{pLt{x),...,pL„{x)) : '^PLi(x) < 1 or ^(1 -^L,(a:)) < 1} 

i=l i=l 

We now investigate what t-norms can generate universally consistent appro- 
priateness measures. To do this we must first consider the relationship between 
conjunctions and disjunctions of fuzzy labels in the current framework. 

Proposition 2. Lf p is an appropriateness measure generated by t-norm At then 
for {Li,. ..Lm} C LA 

VxG n PLiV...VL^{x) =\/t{pLt{x),...,pL^{x)) 



where 

Vy* G [0,1] : 1= l,...,rnVt(yi,...,j/m) = ^ : y* G T) 

^ATQ{yi,...y,n} 



Proof 

Follows trivially from the fact that AM1-AM3 imply that p is a probability mea- 
sure for every x G Q and hence 



pLiV...VL, 



H 

0 /TC{Li,. 






■Lm } 



(x) 



The special case of the disjunction given in Proposition 2 when m = 2 is 



Vyi,y 2 G [0,1] Vt {yi,y2) = yi + y2~ At{yi,y2) 

This is Frank’s equation [3] and if we further assume that Vt is the dual t-conorm 
of At then At is restricted to the family of Frank’s t-norms defined as follows: 

Definition 12. Frank’s Family of t-norms 
For parameter s G [0, 00 ) 

{ min(yi,?/2) : s = 0 

log^ (1 + : s > 0, s yf 1 

yi X y 2 ■■ s = 1 
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Theorem 2. There is no universally consistent appropriateness measure gen- 
erated by a Frank’s t-norm with s >2 

Proof 

Let LA = {Li, L 2 j L^} and let p, be the appropriateness measure generated by 
As for s > 1 such that for some x & Q pL^(x) = = PL^ix) = y for some 

y € [0,1]. Then by Proposition 1 and Definition 12 it follows that 



= l-iy + iAs{y,y)~ As{y, y,y) = l- ‘iy + log^ 



/ 


[1 1 (3“-l)n 




3 + s-1 






V 


3+ (s-l)3 



We now show that for s>2 there exists a value of y for which 



1 - 3y + log^ 



1 + 






S-1 



V 



1 + 






< 0 



Putting z = s'^ — 1 and w = s — 1 this corresponds to: 



jw + z^)^ (0+1)3 

w{w'^ + 03) ty + 1 



where w > 0 and z G [0, w] 

(w + z^){w + 1) < w{w‘^ + 03)(a: + 1)3 
^ 0® - 3w0® + 3w^0"‘ - {w^ + w)z^ + 3w^0^ - iw^z z^ < 0 
(03 - w){z- w)3 < 0 

=> since z G [0, w] that {z^ — w) > 0 and z < w 
=> 0 > ^/w and 0 <w=>s^ — 1> v^s — 1 and — 1 < s — 1 
y > log^(l + — 1) and y < I 

log^(l + — 1) < 1 1 + — 1 < s => — 1 < s — 1 

=^>s — l>l=^»s>2 



Figures 1 and 2 show plots of the values of mx{%) where y (as defined in Theo- 
rem 2) varies between [0, 1] and s = 0.5 and s = 40 respectively. 

Theorem 3. Lf \LA\ < 3 then the appropriateness measure generated by As is 
universally consistent for all s G [0, 1] 

Proof 

We first prove the result for \LA\ = 3 

For X & Q let plAx) = yi^prAx) = 2 / 2 , Mia (a;) = 2/3- With- 
out loss of generality we need only consider the following four subsets: 
{Li, L 2 , L^}, {Li,L 2 }, {Li}, 0. The s = 1 and s = 0 cases are aready proved 
(see [6], [7]). Therefore, we assume s G (0,1). 

Now by Definitions 9 and 12 we have: 



mx{[Li,L2,LA}) 



L.s{yi, 92,93) 



logs 




(s*^i - l)(s2^3 - l)(s«3 - 



1 ) 



(s-l)2 
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m^{{Li,L2}) = ^t{yl,y2) - As(j/i,?/ 2,J/3) = logs 



(S«'l-l)(s='2-l) 

s -1 



1 + 



(s»l-l)(s»2-l)(s«3-l) 



m^{{Li}) =yi- As(j/i,2/2) - /\s{yi,y3) + As(yi, j/2, 2/3) 



= logs 



(■ 



s«M 1 + 



(s»l-l)(s«2_l)(s«3_ 
( 3 - 1)2 






(1 + (s.r-lKp-D ) ( 



1 + 






m^{9) = 1 - 2/1 - y 2 - 2/3 + As( 2 /i, 2 / 2 ) + As(2/i, 2/3) + As(2/2,2/3) - As ( 2 / 1 , 2 / 2 , 2 / 3 ) 



s 1 + 



(s«l-l)(s«2-l) 



= logs 



1 + 



S«1 -l)(s«'3-l) 



)( 



1 + 



(s«2-l)(st'3-l) 



sv.sv^sy^ (1 + (3»-3-1)(3J2-1)(s.3-1) ) 



Now let 01 = — 1 , Z2 = — 1 and Z3 = — 1 and note that zi,Z2,zs G 

[s- 1 , 0 ]; 

Now mx{{Li, L2, L3}) > 0 trivially since As is a t-norm. 

Since s G (0, 1 ) ma:({Li, L 2 }) > 0 



(5 - 1 )(S - 1 + Z1Z2) 
(s^ — 2s + 1 + Z\Z2Z‘i^ 



< 1 



(since the denominator is>Q) (s— l)(s— l + ZiZ2)~s^ + 2 s— 1 — 2;iZ2-Z3 < 0 
ZiZ2{s — l — zz) < 0 This holds trivially since zi < 0, Z2 < 0 and (s— 1 — 2:3) < 0. 
Since s G (0, 1) ma:({Li}) > 0 



(S^ - 2s + 1 + ZiZ2Z3){zi + 1) ^ ^ 
(s — 1 + ZiZ3)(s — 1 + Z1Z2) 



(s^ — 2s + 1 + 2122 - 2 ^ 3 ) (-^l + 1 ) — (s — 1 + 2 iZ 3 )(s — 1 + Z1Z2) S: 0 
(since the denominator is > 0^ 2 i(l — s + 23 )(l — s + 22 ) < 0. 

This holds trivially since Z\ <0, (1 — s + 22) < 0 and (1 — s + 23) < 0. Since 
s G (0, 1) mx{%) > 0 



a(s - 1 + 2 i 22 )(s - 1 + 2i23)(s - 1 + 22Z3) ^ ^ 

(s - l)(2i + 1){Z2 + 1)(23 + l)(s2 - 2s + 1 + Z1Z2Z3) ~ 



(since the denominator is <0) 

s(s - 1 + 2i22)(s - 1 + 2i23)(s - 1 + Z2Z3) ~ (s - l)(2i + 1)(Z2 + l)(zs + l)(s^ - 
2 s + 1 + Z1Z2Z3) > 0 (1 — S + 23)(1 — S + 22)(1 — S + 2i)(l — S + Z1Z2Z3) > 0 

This holds trivially since {1 — s + zi) > Q , (1 — s + 22)>0, (1 — s + 23) > 0 anrf 
(1 - s + 212223) > 0. 

Since the \LA\ = 3 zs universally consistent then it must follow that the 
\LA\ < 3 case must also be universally consistent, otherwise we could extend any 
counter example where \LA\ < 3 to the \LA\ = 3 case by setting the remaining 
appropriateness degrees to zero. 

Empirical studies suggest that this result can be extended to cases where \LA\ > 
3 however, no general result has yet been proven. 
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Fig. 1. Plot of values of mx{0) 
where s = 0.5 fJ-Li(x) — fiL 2 {x) = 
= y and y varies between 0 

and 1 



Fig. 2. Plot of values of mx.{0) 
where s = 40 yLi(x) = yL 2 {x) = 
fiL^ix) = y and y varies between 0 
and 1 



5 Conclusions 

In this paper we have outlined a random set model of fuzzy labels defined on 
subsets of labels and for which functionality is maintained through theuse of 
mass selection functions. This model was shown to be characterised by a simple 
axiomization of appropriateness measures allowing for the functional represen- 
tation of fuzzy concepts while preserving the law of the excluded middle and 
standard logical equivalence properties. It was, also, shown how t-norms can be 
used to generate appropriateness measures and we argued that in this context 
attention should be restricted to the family of Frank’s t-norms. It was shown 
that no Frank t-norm with s > 2 can generate a universally consistent appro- 
priateness measure and it was shown that for \LA\ < 3 all Frank’s t-norms with 
s G [0, 1] generate universally consistent appropriateness measures. Empirical 
tests suggest this latter result can be extended but further research is required. 
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Abstract. The paper describes a method for inducing first-order 
rules with fuzzy predicates from a database. First, the paper makes a 
distinction between fuzzy rules allowing for some tolerance with respect 
to the interpretative scope of the predicates, and fuzzy rules aiming at 
expressing a set of ordinary rules in a global way. Moreover the paper 
only considers the induction of Horn-like implicative-based fuzzy rules. 
Specific confidence degrees are associated with each kind of fuzzy rules 
in the inductive process. This technique is illustrated on an experimental 
application. The interest of learning various types of fuzzy first-order 
logic expressions is emphasized. 

Keywords: inductive logic programming, fuzzy rule, confidence 

degree 



1 Introduction 

Inductive Logic Programming (ILP) [14] provides a general framework for learn- 
ing classical first-order logic rules, for which reasonably efficient algorithms have 
been developed (Progol [11], FOIL [17]...). But, first-order logic cannot directly 
handle rules with exceptions, which are common in practice. This has been a 
motivation for introducing probabilities in ILP ([12]). In fact, probabilities, al- 
ready implicitly appear the control procedure in FOIL. Indeed, during the gain 
computation, the value associated to a rule can be viewed as a confidence degree 
expressed in terms of “domain probabilities” . Such probabilities, together with 
“world probabilities” , are the basic notions of Halpern’s first-order probabilistic 
logic [8]. Domain probabilities are used to capture statistical information for a 
fixed first-order logic interpretation. These probabilities are obtained by apply- 
ing a probability measure to the set of valuations which make the rules true in 
the interpretation. So, there is no longer any genuine quantifier in a rule when 
there is a non-zero probability to encounter exceptions. 

World probabilities are used in order to evaluate the set of interpretations 
where a rule is universally true. Bacchus [2] uses them for capturing degrees of be- 
lief, given a knowledge base KB, by computing the proportion of interpretations 
which are models of KB and of the target assertion, among the interpretations 
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of KB. But the effective calculus is generally tricky. In a recent paper [15], it has 
been pointed out that these world probabilities could be also used for handling 
fuzzy predicates, modeled in terms of likelihood functions (this work only con- 
siders one kind of fuzzy rules). Halpern’s framework has also been adapted for 
dealing with relational databases [7]. Then, world probabilities are interpreted 
as probabilistic dependencies of interpretations and are represented by Bayesian 
networks. Besides, Muggleton in [12] presents another approach to induction 
with probabilities. His goal is to learn Stochastic Logic Programs (SLP). In this 
latter framework, the probability associated to a formula is not a confidence 
degree, but it rather represents the potential of the rule for explaining exam- 
ples. Roller [10] describes a method for computing the confidence degree of a rule 
from examples by using Bayesian networks. Confidence values are then computed 
independently from the learning process on the basis of other examples. 

In the propositional framework, confidence degrees have been integrated in 
the induction process, together with the handling of fuzzy properties. At least 
three main trends of works can be distinguished w.r.t. this latter concern. First, 
neuro-fuzzy learning techniques have been developed for tuning fuzzy member- 
ship functions in fuzzy rules; see [13] for a survey. The fuzzy rules, which are 
produced in that way, are used for functions approximation in automatic control 
problems. Another research line has been investigated with a greater concern 
for the descriptive power of the fuzzy rules from the user’s point of view, by 
extending Quinlan’s [16] IDS algorithm to fuzzy decision trees, involving a fuzzy 
descriptions of classes and making use of entropy measures (extended to fuzzy 
sets) for building the fuzzy rules; see [4] for a survey. More recently, the use of 
fuzzy membership functions has been advocated by several researches for pro- 
viding association rules in data mining with a better representation power, e.g. 
[9]. In these different problems, fuzzy rules are derived, which involve unary 
fuzzy properties generally. Moreover, fuzzy association rules are completed with 
(usually scalar) confidence and support degrees. 

However, fuzzy degrees [3] have been proposed for taking into account the 
possible variation of these degrees with the level cuts of the fuzzy sets. In the 
above mentioned works, fuzzy sets are introduced either for equipping rules with 
interpolative mechanism or for making them more robust. Indeed fuzzy sets can 
serve various purposes, and moreover there exist different types of fuzzy rules 
[6], modeling uncertainty or graduality in property satisfaction. Propositional- 
like rules having a limited expression power, the aim of this paper is to adapt the 
ILP approach (restricted here to non-recursive function-free Horn clauses) and 
the computation of confidence degrees in order to allow for the learning of various 
types of first-order rules which may involve fuzzy predicates. For this purpose, we 
use domain probabilities in order to describe the associated confidence degrees of 
each kind of rules. Moreover algorithms such as FOIL embed confidence degrees 
for controlling the learning process. This will enable us to handle the fuzziness 
of the predicates directly in the computation. 

The paper is organized as follows. Sections 2 provides a brief background 
on ILP and probabilistic logic. Section 3 presents different types of fuzzy rules 
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and an approach for computing the confidence degree associated to each type. 
Sections 4 and 5 describe our algorithm and illustrate the approach on an 
example. 



2 Background 

2.1 ILP 

We first briefly recall the standard definitions and notations. Given a first-order 
language C with a set of variables Var, we build the set of terms Term, atoms 
Atom and formulas as usual. The set of ground terms is the Herbrand universe 
T~L and the set of ground atoms or facts is the Herbrand base B C Atom. A 
literal I is just an atom a (positive literal) or its negation -la (negative literal). A 
(resp. ground) substitution a is an application from Var to (resp. ’H) Term with 
inductive extension to Atom. We denote Subst the set of ground substitutions. A 
clause is a finite disjunction of literals liV . . .yin also denoted {li, . . . , /„}. A Horn 
clause is a clause with at most one positive literal. A Herbrand interpretation 
I is just a subset of B: I is the set of true ground atomic formulas and its 
complementary denotes the set of false ground atomic formulas. Let us denote 
1=2®, the power set of B i.e. the set of all Herbrand interpretations. We can 
now proceed with the notion of logical consequence. 

Definition 1. Given A an atomic formula, I,a \= A means that cr(A) G I. As 
usual, the extension to general formulas F uses compositionality. 

I \= F means Vcr, I,a \= F (we say I is a model of F ). 

\= F means VI G I, I \= F . 

F \= G means that all models of F are models of G. 

Stated in the general context of first-order logic, the task of induction is to find 
a set of formulas FI such that: 



BGH'^E ( 1 ) 

given a background theory B and a set of observations E (training set), where 
E, B and FI here denote sets of clauses. A set of formulas is here, as usual, 
considered as the conjunction of its elements. Of course, one may add two natural 
restrictions: 

— B E since, in such a case, FI would not be necessary to explain E. 

— BiJ H T-. this means H U iL is a consistent theory. 

In the setting of relational databases, inductive logic programming is often re- 
stricted to Horn clauses and function-free formulas, E is just a set of ground 
facts. Moreover, the set E itself satisfies the previous requirement but it is gen- 
erally not considered as an acceptable solution since it has no predictive ability. 
Usually, rules extraction fits with the idea of providing a compression of the 
information content of E. 
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There are two general types of algorithms, top down and bottom up algo- 
rithms. Top down ones start from the most general clause and specialize it step 
by step. Bottom up procedures start from a fact and generalize it. In our case, we 
will use the FOIL algorithm [17] which is of the top down type. The goal of FOIL 
is to produce rules until all the examples are covered. Rules with conclusion part 
(7, the target predicate, are found in the following way: 

1. take A — >■ (7 as the most general clause with A = ~V 

2. choose the literal I such as the clause I A A ^ C maximizes the gain function 

3. A = IaA 

4. if confidence (A — >• (7)< threshold goto 2 

5. return 7l — >■ (7 

The gain function is computed by the formula: 

gain{l AA^C,A^C)=n* {log 2 {cf{l A A^ C)) — log 2 {cf{A — >• (7))) 

where n is the number of distinct examples covered by I A A ^ C . Given a Horn 
clause H — >• (7, the confidence cf{A — >• (7) = . The way for computing 

probabilities of first-order logic rules is presented in the next subsection. 

2.2 Probabilistic Logic 

We focus here on the logic of probability as developed by Halpern in [8]. The 
aim of Halpern’s work, inspired by previous works of Bacchus [1], is to design a 
first-order logic for capturing reasoning about beliefs and statistical information. 
Here we restrict Halpern’s framework to the usual logic programming setting. 
This means that the domain object is the Herbrand universe "H, an interpre- 
tation / is just a subset of B. In fact, we just apply Halpern’s definitions for 
attaching probabilities to Horn clauses, without using the associated notion of 
logical consequence. 

Let us give a meaning to the probability of a non-closed first-order formula in 
a given interpretation. Halpern names type 1 structure the triple M = {/, 'H, P} 
where I is an Herbrand interpretation, "H is the Herbrand universe, and P is 
a probabilistic measure over H (of course, the probability P" is available over 
the product domain "H"). Given a type 1 structure M and a non-closed formula 
F with the vector of n free variables, the meaning of P\P) (abbreviated in 
P{F) when / is clear from the context) is the probability that a random vector 
it on "H" makes a[~t /^\{F) true in /, and so the formal definition: 

Definition 2. Given M = {/, "H,P}, F a formula with a vector on n free 
variables, the probability of F is given by: 

P(P) = e P" I / ^ cr[l^/^](F)} (2) 

If P is a uniform probability over TL, it is easy to see that P(P) is a frequency. 

Type 1 structures are useful for capturing statistical information but these 
structures are insufficient for describing probabilities on closed formulas. Indeed, 
a closed formula has no free variable. So the type 1 probability of a closed formula 
is 0 or 1 according as this formulas is true or false in I. 
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3 Various Types of Fuzzy Rules and Confidence Degrees 

We consider a first-order logic database K with fuzzy predicates (e.g., com- 
fortable, close-to) as a set of positive facts labeled by real numbers in [0,1]. 
For instance in Sect. 4 we shall deal with a database containing facts such as 
{comfortable{a),0.5) and {closeJ,o{a, sea), 0.4). Thus, K is made of pairs of the 
form , ^{A{~t))) for G "H”, where A{lt) is a fact, and pi{A{~t)) is the 

satisfaction degree associated with the fuzzy property A for . 

3.1 Different Types of Fuzzy Rules 

There exist at least two reasons for introducing fuzzy predicates in universally 
quantified rules. This may be for making them either more flexible, or more 
expressive. Indeed, a fuzzy predicate can be viewed as a family of ordinary pred- 
icates whose characteristic functions are the level cut functions hFc associated 
to the fuzzy set membership function fj,p, namely = 1 iff ^(F('af)) > a 

and = 0 otherwise. Thus a rule — >• is naturally associ- 

ated with the crisp rules “A^i^) — >■ Ci 3 {~t)” . Note that, if Ajsiyt) holds then 
Aa(^) also holds for a < p. So we may only consider the crisp approximations 

Then, if we are concerned by flexihility, a possible understanding of the fuzzy 
rule “A{~t) ^ can be 

yiP, 3a Aa{lP) ^ Cai't), (3) 

i.e. there exists a crisp understanding of the fuzzy rule which covers each example 
(but it is not necessary the same for each example since a depends on ‘^). This 
is a kind of rule already considered in [15]. By flexible rules, we mean here rule 
which are robust, since its predicates can be adapted to borderline situations. 

If we are concerned by expressivity, we may look for fuzzy rules such that the 
rule holds for each o/its level cut counterpart. This means that we have 

Aa{lt) ^ (4) 

This is clearly more restrictive than (3) since the fuzzy rule is equivalent to a 
set of ordinary rules with nested predicates and summarizes it into a unique 
fuzzy rule. In fact (4) is nothing but a gradual rule [6] expressing “The more 
satisfies A, the more iP satisfies C” (since they are modeled by a constraint of 
the form ^(A(af)) > ^(C'(af))). 

Gradual rules are one of the four basic kinds of fuzzy rules [6]. Two of them, 
namely gradual rules and certainty rules, are based on implication connectives 
and express constraints on the possible models of the world. The two other types, 
named possibility rules and antigradual rules rather express that some values 
are guaranteed to be possible (i.e. that they exist in the base of examples). For 
instance, let us take possibility rules of the form “The more it is A, the more 
all the interpretations which makes C true (truth becomes a matter of degree 
when C is fuzzy) are guaranteed to be possible ”. This means in practice that 
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“The more is A, the more there are examples for any possible interpretation 
of C”. Note that this rule cannot have any counter-example. 

In the following, we only consider gradual and certainty rules. Certainty rules 
contrast with possibility rules, and express that “the more it is A, the more 
certain is C”. Let us first consider the case where “A” is a fuzzy predicate 
and “C” is an ordinary predicate. This expresses that “the more it is A, i.e. the 
greater a such that A('^) > a, the smaller the number of exceptions of the rule 
— >■ C{~ty\ Indeed when a decreases the number of exceptions cannot 
but increase since the scope of Aa is then enlarged. When C is also a fuzzy 
predicate, in order to preserve this understanding of the rule, we are led to look 
for rules of the form 

V^,Va -)> (5) 

since when a increases C\-a cover more cases. 



3.2 Confidence Degree in the Crisp Case 

For computing confidence degrees, we must define probabilities on first-order 
logic formulas from ILP data. ILP data are supposed to describe one interpreta- 
tion under Closed World Assumption, i.e. we use domain probabilities. We call 
liLP this interpretation. So, given a fact /: 

//LP h / iff s A A h /• 



The domain 'H. is the Herbrand domain described by B and E. We take P as, & 
uniform probability on "H . So it is easy to deduce that the confidence in a clause 
A — >■ C, with as vector on the n free variables, is : 






|{lz» e I IiLP h A C(t))}| 

|{^ G I IiLP h <^[^/^](^(^))}| 



where | | denotes cardinality. Another possible definition of a confidence degree 
might be taken here as the proportion of the number of positive examples covered 
by the rule w.r.t. the number of total examples (positive and negative) covered 
by the rule. This confidence degree would represent the probability that a fact 
deduced from the rule is true. But this definition would not take into account 
the number of situations covered in the condition part of the rule, which is not 
always the total number of examples covered since we are in a first-order setting. 

In ILP, the goal is to learn a concept represented by a predicate. E is the 
set of all facts pertaining to the target predicate. B is the set of facts pertaining 
to predicates other than the target one. So the learned rules are (in the non- 
recursive case) composed by predicates that appear in B for the condition part 
and by the target predicate in the consequence part. 



3.3 Confidence Degree in the Fuzzy Case 

Flexible Rules. This first type of meaning for a fuzzy rule — >■ C(”t^)” 

is close to the one of a classical rule. Of course, we are now expecting that 
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the satisfaction degrees of A{lt) and C'(l^) are as high as possible. So we can 
introduce classical interpretations associated with each a-cut. 

Definition 3. An a-interpretation la, given a fact f is defined by: 

Ia\= f iff B AE \= f and p.{f) > a 

In this type of interpretations, only facts that have a satisfaction degree greater 
than a are true. Of course we have la C Iilp- Now we have to compute the confi- 
dence degree of the rule in the classical way (using (6)) for each a-interpretation. 
According to the intended meaning of the fuzzy rule, we must favour the confi- 
dence degrees of the rule computed in high a-interpretations. Indeed, we prefer 
that the examples be covered at a high degree of satisfaction. The following def- 
inition, which is an adaptation in term of first-order logic of the one proposed 
by [5], takes this into account: 

cfflexiACt) -A CCt)) = - tti+i) * cf{A{'t) -A CCt))i^, 

Oii 

where ai = = 0 is the less and less ordered list of the satisfaction degrees 

that appear in the database. This confidence degree corresponds to the discretiza- 
tion of a Choquet integral of the confidence degrees on a-interpretations. 

Gradual Rules. In this case, the values of the satisfaction degrees are only 
useful for comparing satisfaction degrees in condition and conclusion parts. So, 
we do not priviledge the confidence degree in high a-interpretation as previously. 

cfgrad{A{^) -A C(t)) = 

I IiLp\=cr[t /lt]{AAC)M'^/^]C)>n{A't /^]A)}\ 
ll^ew" I iiLp\=<T[f /lt]{A)}\ 

When the valuation of the condition part of the rule is a conjunction of 
grounds literals, the satisfaction degree of this conjunction is the minimum of 
the degree of each literal. 

Certainty Rules with Fuzzy Conditions. The meaning of the fuzzy rule 
“A(l^) -A is then “the more 7^ is A, the more certain is C”. For these 

rules we are not interested in the satisfaction degrees of the consequence parts. 
This type of rule will be referred to as type 1 certainty rules in the following. The 
a-cut for these rules correspond to the following type of classical interpretation: 

Definition 4. An a-certainty interpretation, given a fact f is defined by: 

la-cert \= f iff (B \= f and p.{f) > a) or A ^ / 

Note that we cannot construct this type of interpretation in the recursive case, 
since B C\ E ^ 0. With this kind of rules, confidence degrees are expected to 
be high for high a-certainty interpretation. The idea is that we can be more 
permissive with respect to exceptions for the classical counterparts of the rule 
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“A{t) — >• C{Ty^ corresponding small values of a. So, we are led to use the 
following Choquet integral. 

t 

cfcerti{ACt) CCt)) = - a*+i) * cf{A{~t) 

OLi 

Certainty Rules with Fuzzy Conditions and Conclusions. The above 
definition is modified in the following way for taking care of the satisfaction 
degree of the consequence of the rules. This type of rule will be referred to as 
type 2 certainty rules in the following 

cfcert 2 {A{ t ) — >■ C( tA = 

lilteu’' I IiLp\=cr[t /lt]A)}\ 

I IlLp\=<T[t /lt]{A)}\ 

Examples of Confidence Degree Computation: consider the following 
database with satisfaction degrees associated with fuzzy predicates A and C: 

Bi = {A{a, b), 1; A{a, c), 0.5; A{b, a), 0.7; A{c, &), 0.2} Ei = {C(a), 0.8; C(6), 0.4}. 
Any other fact is assumed to be false. 

cffieAMX,Y) ^ C{X)) = 0.61; cfgrad{A{X,Y) ^ C{X)) = 0.25 
cUrti{A{X,Y) ^ C{X)) = 0.95; cUrt 2 {A{X,Y) ^ C{X)) = 0.75 

i ?2 = {A{a, b), 0.8; A{a, c), 0.1; A{b, a), 0.4; A{c, 6), 0.3} 

^2 = {C(a), 0.9; C(6), 0.6}. 

cffieMiX,Y) ^ C{X)) = 0.70; cfgrad{A{X,Y) ^ C{X)) = 0.75 
cfcerti{A{X,Y) ^ C{X)) = 0.70; cUrt 2 {A{X,Y) ^ C{X)) = 0.25 



4 Algorithmic Issues 

In the FOIL algorithm, the inductive process in a given state is guided by three 
things: the confidence degree, the stopping condition and the number of distinct 
examples that are covered by the rule. An example is an element of E, a counter- 
example is a fact pertaining to the target predicate which is false in the ILP 
interpretation. Confidence degrees for different kinds of fuzzy rules have been 
defined in the previous section, we have now to describe the halting condition 
and the counting of distinct examples covered by the rules. 

In the crisp case, the FOIL algorithm stops if there are no counter-examples 
covered by the rule. In this case, the confidence degree is 1, i.e. the rule is totally 
certain. But, in the fuzzy case, a rule that covers only positive examples may have 
a low confidence degree. For example, given the fuzzy data B = {A(a, 6),0.2} 
E = {C{a), 1}, the rule “A{X,Y) C{X)” covers the unique example, but we 

have cffiex{A{X, Y) — >• C{X)) = 0.2. This suggests the use of a threshold. 

The number n of positive examples covered by the rule is needed for com- 
paring rules which have similar confidence degrees. The counterpart of n in the 
fuzzy case is different according to the type of the fuzzy rule. In the case of 
gradual rules or of type 2 certainty rules, n is the number of examples covered 
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by the crisp version of the rule and whose conditions and consequence obey the 
requirement about satisfaction degrees. Let us call tl the q free variables that 
appear in the head of the rule and I 2 be the other r variables that appear in the 
rule. Then we have: 

ngrad{ACt) ^ C{~t)) = IH G 34 G 7^" I 

IiLP h ^]{A A C, /r(cr[ltM]C) > ^]A)}\ 

and 

'n'cert2{A( t ) — >■ C(t )) = \{xt G ,3x^ G I 

IiLP h (r[tt /^,^]I a A C), /^]C) > 1 - /^,^]A)}\ 

In the case of rules that handle flexibility, we are more interested in the 
examples covered by high a-interpretations. So, we compute a Choquet integral 
of the number of distinct examples covered in each a-interpretation. 
nflex{A{~t) -A C{~t)) = 

- o^^+l) * |{^ G G W I lai \= a\tl /x\,x^]{A A C)}\ 

For the type 1 certainty rules, we are especially interested in the examples 
covered in the high a-cert interpretation, for which the confidence degree is high. 
Then we have: 

ncertl{A{ t)^C{t)) = 

- «*+l) * l{^ G W,3x^ G W I la cert h , ^\{A A C)}\ 

Example of Computation of ‘n’: 

with Bi, El as in Sect. 3.3: 

nficx{A{X,Y) -A C(X)) = 1.1; ngrad{A{X,Y) -A C{X)) = 1 
ncertl{A{X,Y) -A C{X)) = 1.5; ncert 2 {A{X,Y) -A C{X)) = 2 



Thus, we can use the FOIL algorithm for inducing various kinds of first-order 
fuzzy rules by adapting confidence degree and cardinality with the type of rules 
that we want to learn. 



5 Illustrative Example 

As an illustration of our approach, we have explored a database that can be 
found on the PRETI platform (http://www.irit.fr/PRETI/). This database 
describes houses to let for vacations. There are more than 600 houses described 
in terms of about 25 attributes. A lot of these attributes are about distances 
between the house and another place (sea, fishing place, swimming pool, ...) 
or about prices at different periods in a year (June, weekend, scholar vaca- 
tions ...). Typically, these attributes have a fuzzy interpretation. So, from our 
database, we build a fuzzy database by merely changing price, number of room, 
and distance information into fuzzy information such as “cheap”, “expensive”, 
“smalLcapacity” , “high_capacity” (these two latter expressions refer to the size 
of the house), “far”, “not_too_far”, “not Jar” together with a membership degree. 
For instance, “cheap” and “expensive” are represented by the following trape- 
zoids (0, 0, 800, 2000) and (2500, 3800, 10000, 10000). “close Jo”, “notJooJar” 
and “far” are represented by the following trapezoids (0, 0, 5, 10), (5, 10, 30, 35) 
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and (30, 35, 100, 100) respectively. For discrete attribute membership, degrees 
have to be directly associated which each possible values. 

For example, the fact that the house Xi lies at 5.5 km from sea, represented 
by the logical atomic formula distance{xi, sea, 5.5), becomes the fuzzy predicate 
close Jo{xi, sea) with 0.8 as satisfaction degree and noJoo-far{xi, sea) with 0.2 
as satisfaction degree. The tables below present some examples of first-order 
rules that have been found by the algorithm, for each type of fuzzy rule. 



Flexible Rules 


Confidence 


expensive{A, B) , dishwasher (A), small-capacity (A) comfortable{A) 


0.91 


very _comf ortable(A) , expensive{A, B) close_to{A, sea) 


0.86 


not ^com f ortable{A) ,, far{A, B) cheap{A, September) 


0.56 


expensive{A, C), B — shop, small .capacity (A) close_to{A, B) 


0.90 


Gradual Rules 


expensive{A, B) , washingmachine{A) comf ortable{A) 


0.96 


expensive{A, B) com f ortable(A) 


0.90 


area{A, N ARBON N AI S) , high_capacity{A) close_to{A, sea) 


1 


not-comfortable{A), far{A, B), small .capacity {A) cheap{A, September) 


0.88 


Type 1 Certainty Rules 


area{A, LAURAGAIS), expensive{A, C) comfortable{A) 


0.92 


very .comf ortable(A) close.to{A, sea) 


0.83 


area{A, LIMOUXIN) cheap{ A, September) 


0.86 


expensive{A, B) , phone{A) , small.capacity{A) television(A) 


0.91 


Type 2 Certainty Rules 


dishwasher (A), area{A, B) , television(A) comfortable{A) 


0.93 


area{A, N ARBON N AI S) , high.capacity(A) close.to{A, sea) 


0.90 


number .of .chamber{A, 1), pet.accepted(A) , far{A, sea) not .com f ortable{A) 


0.93 


high. capacity {A ) , area{A, N ARBON N AI S) , phone(A) very.comfortable(A) 


1 



Note that some rules could be expressed in propositional logic, but here the 
instantiations are automatically generated by the algorithm. As shown in some 
rules, the algorithm can mix fuzzy predicates and non- fuzzy predicates. 

Since flexible rules have the same meaning as crisp ones, but just privilege the 
high a-cuts for the instantiation of the rule; the rules found remain quite similar 
to the ones that could be found in the non-fuzzy approach. For the gradual 
rules, what is learn is an original kind of rule. For example, the first rule in the 
gradual rules table means: “the more there exists a time period when the house 
is really expensive and if the house has a washingmachine, the more the house is 
comfortable” . The fuzzy rules that handle certainty tend to favour the non-fuzzy 
predicates in condition part because they leave more freedom with respect to the 
satisfaction degree of covered examples. 
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6 Conclusion 

This paper has described a formal framework and a procedure for learning var- 
ious kinds of first-order rules involving fuzzy (or non-fuzzy) predicates. The 
definition of confidence degrees for rules with fuzzy predicates allows us to easily 
introduce them in any learning algorithm which uses confidence degrees as a 
basis for the guiding process. Since the confidence computation is a weigthed 
version of FOIL’S one, it’s easy to deduce that the complexity of the algorithm 
is the same as the FOIL’S one. Learning rules with fuzzy predicates has some 
obvious advantages. A first one is that fuzzy rules can involve fuzzy categories as 
often used by people. Generally speaking, it is well known that fuzzy sets defined 
on subsets of the real line provide a flexible interface with precise numerical val- 
ues. The use of fuzzy predicates also contributes to reduce the hypothesis space. 
Moreover, since it’s also known that ILP has difficulties for handling real num- 
bers, the use of fuzzy predicates for representing numerically valued predicates 
(as made with the PRETI database) can provide a valuable improvement. The 
different kinds of rules that can be learned increase the expressivity. Moreover, 
the paper proposes a method for learning gradual rules, a topic which has not 
been much considered until now. 

The definition of confidence degrees for each kind of rules allows us to take 
into account the fuzzy predicates in the algorithms that use confidence degrees 
for guiding the learning process. But we also see that trying to learn rules which 
do not cover any counter-example, as in classical ILP, is not sufficient in case 
of fuzzy predicates. It is why, we need to formally describe the ILP setting 
for rules with fuzzy predicates by incorporating confidence degrees. However 
the proposed algorithm could still be improved. First, as in data mining, other 
degrees like support could be computed and used for controlling the learning 
process. Besides, our algorithm has the same limitations as FOIL and cannot 
find all the “interesting” rules ([18]). It is an open track to build an extraction 
algorithm by-passing classical ILP techniques. Finally, another line of research 
is to study the prediction capabilities of the induced rules. 
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Abstract. We study knowledge-based systems using symbolic many- 
valued logic and multiset theory. In previous papers we have proposed a 
symbolic representation of nuanced statements like ’’John is very tall”. 
In this representation, we have interpreted some nuances of natural 
language as linguistic modifiers and we have defined them within a 
multiset context. In this paper, we continue the presentation of our 
symbolic model and we propose new deduction rules dealing with 
nuanced statements. More precisely, we present new Generalized Modus 
Ponens rules and we study the form of graduality verified by these rules. 

Keywords: knowledge representation and reasoning, imprecision, 
vagueness, many-valued logic, multiset theory 



1 Introduction 

The development of knowledge-based systems is a rapidly expanding field in 
applied artificial intelligence. The knowledge base is comprised of a database and 
a rule base. We suppose that the database contains facts representing nuanced 
statements, like ”Jo is very tall”, to which one associates truth degrees. The 
nuanced statements can be represented more formally under the form ”x is TOq 
A” where ttIq and A are labels denoting respectively a nuance and a vague or 
imprecise term of natural language. The rule base contains rules of the form ”if 
X is Too- a then y is mfj B” to which one associates truth degrees. 

Our work presents a symbolic-based model which permits a qualitative man- 
agement of vagueness in knowledge-based systems. In dealing with vagueness, 
there are two issues of importance: (1) how to represent vague data, and (2) how 
to draw inference using vague data. 

When imprecise information is evaluated in a numerical way, fuzzy logic 
which is introduced by Zadeh [10], is recognized as a good tool for dealing with 
aforementioned issues and performing reasoning upon common sense and vague 
knowledge-bases. In this logic, ”x is rria A” is considered as a fuzzy proposition 
where A is modeled by a fuzzy set which is defined by a membership function. 
This one is generally defined upon a numerical scale. The nuance rria is defined 
such as a fuzzy modifier [2,9,10] which represents, from the fuzzy set A, a new 
fuzzy set '"ma A”. So, ”x is A” is interpreted by Zadeh as ”x is {nia A)” and 
is regarded as many- valued statement. A second formalism, refers to a symbolic 
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many- valued logic [7,9], is used when imprecise information is evaluated in a 
symbolic way. This logic is the logical counterpart of a multiset theory introduced 
by De Glas [7]. In this theory, the term Wq linguistically expresses the degree 
to which the object x satisfies the term A. So, ”x is TOq A” means ”x (is ma) 
A”, and then is regarded as boolean statement. In other words, ’’ttIq A” does 
not represent a new vague term obtained from A. 

In previous papers [5,6], we have proposed a symbolic-based model to repre- 
sent nuanced statements. This model is based on the many- valued logic proposed 
by Pacholczyk [9] . Our basic idea has been to consider that some nuances of nat- 
ural language can not be interpreted as satisfaction degrees and must be instead 
defined such as linguistic modifiers. In Sect. 3, we present a short review of this 
model. In this paper, our basic contribution is to propose deduction rules dealing 
with nuanced information. For that purpose, we propose deduction rules gener- 
alizing the Modus Ponens rule in a many-valued logic proposed by Pacholczyk 
[9]. We notice that the first version of this rule has been proposed in a fuzzy 
context by Zadeh [10] and has been studied later by various authors [1,2,8]. 

In Sect. 2, we present briefly the basic concepts of the M- valued predicate 
logic which forms the backbone of our work. Section 3 introduces briefly the 
symbolic representation model previously proposed. In Sect. 4, we study various 
types of inference rules and we propose new Generalized Modus Ponens rules 
in which we use only simple statements. In Sect. 5, we propose a generalized 
production system in which we define more Generalized Modus Ponens rules 
in more complex situations. In Sect. 6, we study the problem of graduality of 
inference and we demonstrate the forms of graduality satisfied by our GMP rules. 

2 M- Valued Predicate Logic 

Within a multiset context, to a vague term A and a nuance ma are associated 
respectively a multiset A and a symbolic degree Tq. So, the statement ”x is TOq 
A” means that x belongs to multiset A with a degree Tq,. The M- valued predicate 
logic [9] is the logical counterpart of the multiset theory. In this logic, to each 
multiset A and a membership degree Tq, are associated a M- valued predicate A 
and a truth degree Tq-— true. In this context, the following equivalence holds: x 
is ma A a; Gq A ”x is ma A” is true ”x is A” is Tq— true. One supposes 
that the membership degrees are symbolic degrees which form an ordered set 
= {Ta, a G [1, Af]}. We can then define in £m two operators A and V and a 
decreasing involution - as follows: Ta'd T/s = Tmax{a, 0 ),ra A = r„i„(a,/ 3 ) and 
^ Ta = tm+ 1 -oc- One obtains then a chain /\,<} having the structure 

of De Morgan lattice [9]. On this set, an implication — >■ and a T-norm T can 
be defined respectively as follows: Tq — >■ T/j = Tmin{ 0 -a+M,M) and T^Ta^rp) = 

Example 1. For example, by choosing M=9, we can introduce: £g={not at all, 
little, enough, fairly, moderately, quite, almost, nearly, completely}. 
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3 Representation of Nuanced Statements 

Let us suppose that our knowledge base is characterized by a finite number 
of concepts C*. A set of terms Pik is associated with each concept Ci, whose 
respective domain is denoted as Aj. The terms Pik are said to be the basic terms 
connected with the concept Cj. A finite set of linguistic modifiers allows us 
to define nuanced terms, denoted as '^maPik' ■ 

In previous papers [5,6], we have proposed a symbolic-based model to rep- 
resent nuanced statements of natural language. We have proposed firstly a new 
method to symbolically represent vague terms. In this method, we suppose that 
a domain of a vague term, denoted by X, is simulated by a ’’rule” (cf. Fig. 1) 
representing an arbitrary set of objects. 

Our basic idea has been to associate with each multiset Pi a symbolic concept 
which represents an equivalent to the membership function in fuzzy set theory. 
For that, we have introduced a new concept, called ’’rule”, which has a geometry 
similar to a membership L-R function and its role is to illustrate the membership 
graduality to the multisets. In order to define the geometry of this ’’rule”, we 
use notions similar to those defined within a fuzzy context like the core, the 
support and the fuzzy part of a fuzzy set [10]. We define these notions within 
a multiset theory as follows: the core of a multiset Pi, denoted as Core{Pi), 
represents the elements belonging to Pi with a tm degree, the support, denoted 
as Sp{Pi), contains the elements belonging to Pi with at least T 2 degree, and the 
fuzzy part, denoted as F{Pi), contains the elements belonging to Pi with degrees 
varying from to tm-i- We associate with each multiset a ’’rule” that contains 
the elements of its support (cf. Fig. 3). This ’’rule” is the union of three disjoint 
subsets: the left fuzzy part, the right fuzzy part and the core. For a multiset Pi, 
they are denoted respectively by Li, Ri and Cj. 

We suppose that the left (resp. right) fuzzy part Li (resp. Ri) is the union 
of M-2 subsets, denoted as [Li]a (resp. [i?i]a), which partition it. [Li]^ (resp. 



small 





X 
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contains the elements of Li (resp. Ri) belonging to Pi with a degree. 
In order to keep a similarity with the fuzzy sets of type L-R, we choose to place, 
in a ’’rule” associated with a multiset, the subsets [Li]^ and [Ri]a so that the 
larger a is, the closer the [Li]a subsets and [Ri]a are to the core Ci (cf. Fig. 3). 
That can be interpreted as follows: the elements of the core of a term represent 
the typical elements of this term, and the more one object moves away from 
the core, the less it satisfies the term. Finally, we have denoted a multiset Pi 
with which we associate a ’’rule” as Pi = {Li,Ci, Ri), and we have introduced 
symbolic parameters which enable us to describe the form of the ’’rule” and its 
position in the universe X. These parameters have a role similar to the role of 
numerical parameters which are used to define a fuzzy set within a fuzzy context. 

3.1 Linguistic Modifiers 

By using the ’’rule” concept we have defined the linguistic modifiers. We have 
used two types of linguistic modifiers. 

- Precision Modifiers: The precision modifiers increase or decrease the pre- 
cision of the basic term. We distinguish two types of precision modi- 
fiers: contraction modifiers and dilation modifiers. We use Me = {mk\k G 
[1..6]} ={exactly, really, 0, more or less, approximately, vaguely} which is 
totally ordered by j < A: 44 mj < mk (Fig. 4). 

- Translation Modifiers: The translation modifiers operate both a translation 
and precision variation (contraction or dilation) on the basic term. We use 
Tg = {tk\k G [1..9]} ={extremely little, very very little, very little, rather 
little, 0, rather, very, very very, extremely} totally ordered by fc < ? 44 tfc < 
(Fig. 5). The multisets tkPi cover the domain X. 

In this paper, we continue to propose our model for managing nuanced state- 
ments. In the following, we focus our intention to study the problem of exploita- 
tion of nuanced statements. 

4 Exploitation of Nuanced Statements 

In this section, we treat the exploitation of nuanced information. In particular, 
we are interested to propose some generalizations of the Modus Ponens rule 





386 M. El-Sayed and D. Pacholczyk 



within a many- valued context [9]. We notice that the classical Modus Ponens 
rule has the following form: If we know that {// ”x is A'’ then ”y is B” is true 
and ”x is A” is true} we conclude that ”y is B” is true. Within a many- valued 
context, a generalization of Modus Ponens rule has one of the following forms: 

FI- If we know that {// ”x is A” then ”y is B” is Tp-true and ”x is A ” is 

Te-true} and that {A is more or less near to A}, what can we conclude for 

”y is B”, in other words, to what degree ”y is B” is true? 

F2- If we know that {// ”x is A” then ”y is B” is rp-true and ”x is A ” is 

Te-true} and that {A is more or less near to A}, can we find a B such as 

{B is more or less near to B} and to what degree ”y is B ” is true? 

These forms of Generalized Modus Ponens (GMP) rule have been studied firstly 
by Pacholczyk in [9]. In this section, we propose new versions of GMP rule in 
which we use new relations of nearness. 

4.1 First GMP Rule 

In Pacholczyk’s versions of GMP, the concept of nearness binding multisets A 
and A is modeled by a similarity relation which is defined as follows: 

Definition 1. Let A and B he two multisets. A is said to be Ta-similar to B, 
denoted as A if and only if: \!x\x €.y A and x Gp B ^ min{r^ -G rp, rp — >■ 

r^} > r„. 

This relation generalizes the equivalence relation in a many- valued context as the 
similarity relation of Zadeh [10] has been in a fuzzy context. It is (1) reflexive: 
A KiM A, (2) symmetrical: A B ^ B «q. A, and (3) weakly transitive: 
{A B, B K^p C} ^ A C with > T{ra, rp) where T is a T-norm. 

By using the similarity relation to model the nearness binding between mul- 
tisets, the inference rule can be interpreted as: {more the rule and the fact are 
true} and [more A and A are similar}, more the conclusion is true. In particu- 
lar, when A is more precise than A (A G A) but they are very weakly similar, 
any conclusion can be deduced or the conclusion deduced isn’t as precise as one 
can expect. This is due to the fact that the similarity relation isn’t able alone 
to model in a satisfactory way the nearness between A and A. For that, we add 
to the similarity relation a new relation called nearness relation whose role is 
to define the nearness of A to A when A C A. In other words, it indicates the 
degree to which A is included in A. 

Definition 2. Let A and B he two multisets such that A C B. A is said to be 
Ta-near to B, denoted as A Ga B, if and only if {ix G F(B), x Gp A and x Gj 
B Tp < T^}. 

The nearness relation satisfies the following properties: (1) Reflexivity: A Gm 
A, and (2) Weak transitivity: A Ga B and B Gp C ^ A G.y C with < 
min{Ta,Tp). In the relation A Ga B, the less the value of a is, the more A is 
included in B. Finally, by using similarity and nearness relations, we propose a 
first Generalized Modus Ponens rule. 
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Proposition 1. Let A and A he predicates associated with the concept Ci, B he 
predicate associated with the concept Cg. Given the following assumptions: 

1. it is Tfj-true that if ”x is A" then ”y is B” 

2. ”x is A ” is Tg-true with A A. 

Then, we conclude : ”y is B” is Ts-true with ts = T{Tj 3 ,T{Ta,Tg)). If A is such 
that A A, we conclude: ”y is B” is Ts-true with ts = T(t/ 3 ,Tq-' — > Tg). 

Example 2. Given that ’’really tall” ’’tall” and ’’really tall” Cg ’’tall”, from 
the following rule and fact: 

- if ”x is tall” then ’’its weight is important” is true 

- ’’Pascal is really tall” is quite-true, 

we can deduce: ’’Pascal’s weight is really important” is almost-true. 



4.2 GMP Rules Using Precision Modifiers 

In the previous paragraph we calculate the degree to which the conclusion of the 
rule is true. In the following, we present two new versions of GMP rule in which 
the predicate of the conclusion obtained by the deduction process is not B but 
a new predicate B which is more or less near to B. More precisely, the new 
predicate is derived from B by using precision modifiers^ {B = mB). The first 
version assumes that the predicates A and A are more or less similar. In other 
words, A may be less precise or more precise than A. The second one assumes 
that A is more precise than A. 

Proposition 2. Given the following assumptions: 

1. it is Tfj-true that if ”x is A” then ”y is B” 

2. ”x is A ” is Tg-true with A «q, A. 

Let Ts = T{Tfj,T{Ta,Tg)). If Ts > Ti then there exists a Tn(s)~ dilation modifier 
m, with Ts < T{Ta,Tjj), such that: ”y is mB” is Tg'-true and Tg> = ts — > ts- 
Moreover, we have: B C mB and mB B. 

This proposition prove that if we know that A is more or less similar to A, 
without any supplementary information concerning its precision compared to A, 
the predicate of the conclusion obtained by the deduction process (mB) is less 
precise than B (i.e. B C mB) and which is more or less similar to B. In the 
following proposition, we assume that A is more precise than A. 

Proposition 3. Given the following assumptions: 

1. it is Tfj-true that if ”x is A” then ”y is B” 

2. ”x is A ” is Tg-true with A Cq, A. 



^ The definitions of these are presented in appendix A. 
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Let TQ = T{Ti3,Ta — > Tf). If Tg > Ti then there exists a Tn{s) — contraction 
modifier m, with ts > — > Ta, such that: ”y is mB” is T^>-true and = 

T{Ts,Tg). 

Moreover, we have: mB IZi B. 

This proposition prove that from a predicate A which is more or less near to 
A we obtain a predicate mB which is more or less near to B. More precisely, 
if A is more precise than A then mB is more precise than B. The previous 
propositions (2 and 3) present two general cases in which we consider arbitrary 
predicates A' . In the following, we present two corollaries representing special 
cases of propositions 2 and 3 in which we assume that the rule is completely true 
and that A' is obtained from A by using precision modifiers. 

Corollary 1. Let the following rule and fact: 

1. it is true that if ”x is A” then ”y is B” 

2. ”x is mkA” is Te-true where mu is a r-y,. — dilation modifier. 

If x^j,,Tg) > Ti then we conclude: ”y is m^B” is r^'-true, with t^i =~ 

'C-fk ^ T(~ Tyj, , Te) . 

Example 3. Given the following data: 

- if ”x is tall” then ’’its weight is important” is true, 

- ”Jo is more or less tall” is moderately-true. 

Then, we can deduce: ’’Jo’s weight is more or less important” is moderately-true. 

Corollary 2. Let the following rule and fact: 

1. it is true that if ”x is A" then ”y is B” 

2. ”x is mkA” is r^-true where mk is a — contraction modifier. 

Then, we conclude that: ”y is mkB” is r^-true. 

Example J. Given the following data: 

- if ”x is tall” then ’’its weight is important” is true, 

- ’’Pascal is really tall” is moderately-true. 

Then, we can deduce: ’’Pascal’s weight is really important” is moderately-true. 

These two corollaries present a particular form of graduality of inference. This 
form is known as graduality by means of linguistic modifiers [3]. It enables us 
to obtain, from a fact whose predicate A' is nuanced by linguistic modifiers, a 
conclusion whose predicate is also nuanced by linguistic modifiers. 

4.3 Other Inference Rules 

In the previous paragraphs, we presented GMP rules in which we can either (1) 
calculate the degree to which the conclusion of the rule is true, or (2) to obtain a 
new conclusion which is more or less near to the rule’s one and to calculate the 
degree to which it is true. In this paragraph, we present new inference rules in 
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which the predicate of the conclusion obtained by the deduction process is a new 
predicate B' which is more or less near to B. The new predicate is chosen from 
the set of nuanced predicates associated with a concept. The existence of such 
a predicate is not always sure and is depending on the predicate B and on the 
other predicates associated with the same concept. We notice that these forms 
of inference rules can be used to evaluate the truth of a statement by using rules 
and facts which are available in the knowledge-based system. In other words, if 
we know that {// ”x is A” then ”y is B” is Tjj-true and ”x is A ” is r^-true} 
and that {A and B are respectively more or less near to A and B}, can we 
calculate the degree to which ”y is B ” is true? 

We present below two inference rules. The first version assumes that the 
predicates A and A are more or less similar. The second one assumes that A 
is more precise than A. 

Proposition 4. Given the following assumptions: 

1. it is Tp-true that if ”x is A” then ”y is B” 

2. ”x is A ” is Te-true with A «q, A. 

Let Te = r(r/ 3 , r(Ta, Te)). If Tg > Ti and if there exists a predicate B such that 
B B, then we can conclude that: ”y is B ” is r^'-true and = T(Ts,Tg). If 
the predicate B is such that B \Zs' B , we conclude: ”y is B ” is r^i-true with 

T^' = TS' Tg . 

Proposition 5. Given the following assumptions: 

1. it is Tp-true that if ”x is A” then ”y is B” 

2. ”x is A ” is Te-true with A \Za A. 

Let Tg = T{Tp,Ta — > Te). If Tg > T\ and if there exists a predicate B such that 
B Kig B, then we can conclude that: ”y is B ” is Te'-true and Te> = T(Ts,Tg). If 
the predicate B is such that B IZi' B , we conclude: ”y is B ” is Te'-true with 

Te’ = TS’ Tg . 

Example 5. Let ’’very important” Ce ’’important”, ’’really tall” ’’tall” and 
’’really tall” Cg ’’tall”. Let our knowledge-based system contain: 

- if ”x is tall” then ’’its weight is important” is true, 

- ”Jo is really tall” is quite-true. 

Then, we want to know the truth degree of the statement ’’Jo’s weight is very 
important”. By applying the proposition 5, we can deduce: 

’’Jo’s weight is very important” is fairly-true. 

5 Generalized Production System 

In this section, we present some Generalized Modus Ponens rules in more com- 
plex situations. More precisely, we study the reasoning in 4 situations: 
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1. When the antecedent of the rule is a conjunction of statements. 

2. When the antecedent is a disjunction of statements. 

3. In presence of propagation of inferences. In other words, when the conclusion 
of the first rule is the antecedent of the second rule, and so on. 

4. When a combination of imprecision is possible. In other words, when we have 
some rules which have the same statement in their conclusion parts. 

So, we present the following 4 propositions representing inference rules in these 
situations. 

Proposition 6 (Antecedent is a conjunction). Given the following assump- 
tions: 

1. if ”x\ is Ai” and ... and ”xn is An” then ”y is B” is rp-true, 

2. for i = l..n, ”xi is A^” is Te^-true, 

3. for i = l..n, Ai A^. 

Then, we can deduce: ”y is B” is Ts-true with ts = r(r^,T(rQ,j,r£j)) A ... A 
T{Tj3, T{Tan) "^en))- V’ i = j ■■ k, the predicates A^ are such that A^ we 

can deduce: ”y is B” is Ts-true with ts = A...At5„ and ts^ = T(tq,' — > T,^i,Tp) 

ifi G [j,k] and ts, = T(r/3, T(ra,, r^J) if not. 

Proposition 7 (Antecedent is a disjunction). Given the following assump- 
tions: 

1. if ”xi is Ai” or ... or is An” then ”y is B” is Tp-true, 

2 . for i = l..k, ”xi is AT is T,,,-true, 

3. for i = l..k, Ai A^. 

Then, we can deduce: ”y is B” is Ts-true with ts = T(r/3 , T(tqj , r^j)) V ... V 
T(t/3, T{Ta ,, , Tej, ) ) . If, foT i = j .. L, the predicates Aj are such that Aj Cq,' Aj, we 
can deduce: ”y is B” is Ts-true with ts = ts,\/ ...Vts,. and ts, = T{to,'. — > Te,,Tp) 
G [j,L] and ts, = T(t/3, T(tq,,, r^J) if not. 

Proposition 8 (Propagation of inference). Given the following assump- 
tions: 

1. if ”x is A” then ”y is B” is Tp-true, 

2. if ”y is B” then ”z is G” is T-^-true, 

3. there exists t„ > Ti such that ”x is A ” is T„-true, 

4-. there exists Tc such that A A . 

Then, we can deduce: ”z is G” is Ts-true, with ts = T(T(r^, r^), T( tq, r^)). If 
the predicate A is such that A \Za' A, then we can deduce: ”z is G” is Ts-true, 
with TS = T{T{Tp,Tj),Ta’ Te) . 
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Proposition 9 (Combination of imprecision). Given the following assump- 
tions: 

1. for i = l..n, if ”xi is Ai” then ”y is B” is Tjs^-true, 

2. for i = \..n, ”xi is A^” is r^^-true, 

3. for i = l..n, Ai A^, 

then we can deduce that: ”y is B” is Ts-true with ts = T(t/3j , r(raj , T.J) V...V 
T(t/3„, T(tq„, T e„)). If for i = j .. k, the predicates A^ are such that A^ Ai, 
then we can deduce: ”y is B” is Ts-true with ts = ts-, V ... V ts„ and ts^ = 
T{Ta'. — Tei,Tf3,) if i G [j,k] and ts^ = T{T/ 3,,T{Ta,,TeJ) if not. 

We present below an example in which we use the GMP rules presented in 
this section. In this example, we use index cards written by a doctor after his 
consultations. From index cards (ICi) and some rules (TZj), we wish deduce a 
diagnosis. 

Example 6. Let assume that we have the following rules in our base of rules. 

TZi— ”If the temperature is high, the patient is ill” is almost true, 

7 ^ 2 — ”If the tension is always high, the patient is ill” is nearly true, 

7^3— ”If the temperature is high and the eardrum color is very red, the disease 
is an otitis” is true, 

7^4— ”If fat eating is high, the cholesterol risk is high” is true, 

TZs— ”If the cholesterol risk is high, a diet with no fat is recommended” is true. 

Let us assume now that we have an index card for a patient and we want to 
deduce a diagnosis. 

Ti— ’’the temperature is rather high” is nearly true, 

T2— ’’the tension is always more or less high” is almost true, 

’’the eardrum color is really very red” is quite true, 

Ti— ’’the fat eating is very very high” is moderately true. 

Using the GMP rules previously presented, we deduce the following diagnosis: 

T>i— ’’the patient is ill” is almost true, 

T>2— ’’the disease is an otitis” is almost true, 

V3— ’’the cholesterol risk is high” is true, 

774— ”a diet with no fat is recommended” is true. 

Let us assume that we have the following relations: ’’rather high” IZ7 ’’high”, 
’’more or less high” ’’high”, ’’really very red” Cg ’’very red” and ’’very very 
high” IZ2 ’’high”. Then, the diagnosis {T>i - V4) are obtained as follows. 

- T>i is obtained by applying proposition 9 to {iFi,iF2) and (JZi,IZ2), 

- T>2 is obtained by applying proposition 6 to and TZ^, 

- T>3 is obtained by applying proposition 1 to T4 and 7^.4, 

- T>4 is obtained by applying proposition 8 to F4 and {TZ4,TZ^). 
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6 Graduality of Inference 

In this section, we are interested to investigate the graduality of inference verified 
by our GMP rules. We distinguish mainly five forms of gradual rule involving 
graduality: (1) on the truth value, (2) on inclusion between multisets, (3) by 
means of linguistic modifiers, (4) or dealing with the degree of similarity or the 
proximity of multisets, (5) or graduality on terms associated with a concept. In 
this section, we are limited to study the graduality based on the truth value and 
on inclusion between multisets. Other forms of graduality will be investigated in 
next papers. 

In the following, we study the graduality verified by the GMP rules proposed 
in paragraphs 4.1 and 4.2. The following proposition prove the form of graduality 
underlying by the first GMP rule (proposition 1). 

Proposition 10. Given the following assumptions: 

1. it is Tp-true that if ”x is A” then ”y is B” 

2. ”x is A ” is Te^-true, 

then, we can conclude: ”u is B” is t> -true. 

If we have the fact ”x is A ” is -true such that {A C A and }, we 

can conclude: ”u is B ” is tj -true with t> > tj . 

€2 ^2 

In other words, the GMP rule in the proposition 1 satisfies the following gradual 
rule: the more (the less) the proposition ”x is A ” is true and the more (the less) 
A is included in A {A C A), the more (the less) the proposition ”y is B” is 
true. 

The following proposition prove the graduality underlying by the GMP rules 
in the propositions 2 and 3. 

Proposition 11. Given the following assumptions: 

1. it is Tp-true that if ”x is A" then ”y is B” 

2. ”x is A ” is T^^-true, 

then, there exists a precision modifier mi such that: ”y is miB” is t> -true. 

^1 

If we have the fact ”x is A ” is Te^-true such that {A C A and > Tei}, 

then we can find a precision modifier m 2 such that: ”y is m 2 B” is tj -true, with 

\m 2 B C m\B and t> > tj |. 

^2 ^1 

This proposition is valid if either A d A C^or.4 C A C A. In the first 
case, we have the following form of gradual rule which is verified by the GMP 
rule in proposition 2: the less the proposition ”x is A ” is true and the less A is 
included in A (i.e. the more A is included in A ), the less the proposition ”y is 
mB” is true and the less mB is included in B. 

In the case of A d A d A, we have the following form of gradual rule which 
is verified by the GMP rule in proposition 3: the more the proposition ”x is A ” 
is true and the more A is included in A, the more the proposition ”y is mB” is 
true and the more mB is included in B. 
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7 Conclusion 

In this paper, we have proposed a symbolic-based model dealing with nuanced 
information. This model is inspired from the representation method on fuzzy 
logic. In previous papers, we have proposed a new representation method of 
nuanced statements. In this paper, we proposed some deduction rules dealing 
with nuanced statements and we presented new Generalized Modus Ponens rules. 
In these rules we can use either simple statements or complex statements. Finally, 
we have studied the problem of graduality of inference and we have studied the 
forms of graduality satisfied by our GMP rules. 
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Appendix A: Definitions of Precision Modifiers 

We distinguish two types of precision modifiers: contraction modifiers and dila- 
tion modifiers. A contraction (resp. dilation) modifier m produces nuanced term 
mPi more (resp. less) precise than the basic term Pi. In other words, the ’’rule” 
associated with mPi is smaller (resp. bigger) than that associated with Pi. We 
define these modifiers in a way that the contraction modifiers contract simul- 
taneously the core and the support of a multiset Pi, and the dilation modifiers 
dilate them. The amplitude of the modification (contraction or dilation) for a 
precision modifier m is given by a new parameter denoted as Tj. The higher r.y, 
the more important the modification is. 
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Definition 3. m is said to be a contraction modifier if, and only if it is 
defined as follows: 

1. if Pi = {Li,Ci, Rf) thenmPi = {L^,Ci, Ri) such that Li and Ri<MRi 

2. Vx, X €a Pi with Ta < tm => X mPj such that f3 = mox(l, a — 7 + 1) . 

Definition 4. m is said to he a r^-dilation modifier if, and only if it is defined 
as follows: 

1. if Pi = {Li,Ci, Ri) thenmPi = {L^,Ci, Ri) such that L^<m Li and Ri<MLii 

2. Vx, X Go- Pi with Tq > Ti => X G /3 mPi such that j3 = min{M, 7 + a — 1) . 




Partial Lattice- Valued Possibilistic Measures 
and Some Relations Induced by Them 



Ivan Kramosil 

Institute of Computer Science 
Academy of Sciences of the Czech Republic 
Pod vodarenskou vezi 2, 182 07 Prague 8, Czech Republic 
kramosilScs . cas . cz 



Abstract. For a number of reasons rooted in our surrounding real world, 
the degrees of uncertainty and, in particular, the values of possibilistic 
measures related to various phenomena, need not be definable quantita- 
tively, by real numbers from the unit interval, say, but rather only qual- 
itatively and related to each other (greater than, not smaller than,. . . 
Moreover, the values of possibilistic measures need not be known or even 
defined for every event from the field of events under consideration. Three 
extensions of partial lattice-valued possibilistic and necessity measures 
to the whole system of events under consideration are introduced and 
some assertions showing their various properties and mutual relations 
are presented and proved. 



1 Introduction, Motivation, Preliminaries 

Because of a very limited extend of this contribution we have to leave purpos- 
edly aside the history, motivation and intuition leading to and staying behind 
the notion of possibilistic measure and possibility theory, as well as the most 
elementary definitions and results concerning the classical possibilistic measures 
conceived as mappings which take the whole power-set of all subsets of a universe 
of discourse into the unit interval of real numbers and meet some well-known 
conditions. The reader is kindly asked to consult [3,4], or another appropriate 
source for these sakes, our more detailed reasonings will begin with the notion 
of partial (numerical) possibilistic measure with the aim to modify it, below, to 
the case of non-numerical possibilistic values, in particular to those taking their 
values in a complete lattice. 

Definition 1.1. Let 17 be a nonempty set, let 0 yf 7?. C V{fl) be a nonempty 
system of subsets of 17. A mapping U which takes TZ into [0, 1] {U : TZ ^ [0, 1], 
in symbols) is called a partial possibilistic measure on 7Z, if 

(i) 77(0) = 0 and /or 77(17) = 1, if 0 and/or 17 are in TZ, 

(ii) for each TZo = {A, B} C TZ such that uT^o = Uagt?.o^ equality 

77(u7?.o) = V{77(A) : A G T^o} holds. 
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A partial possibilistic measure 77 on 7^ is called finitely complete, if (ii) holds 
for each finite nonempty TZq C TZ such that \jTZq £ TZ. A partial possibilistic 
measure 77 on 7?. is called complete, if (ii)holds for each nonempty TZq C TZ such 
that u’^o £TZ. □ 

The condition (ii) for TZ = {A, B} simply claims that 77(A U 77) = II (A) V 
n{B) for each A, B £ TZ such that Au B £ TZ. In spite of the case of total 
possibilistic measures (when TZ = 7^(17)), a partial possibilistic measure need not 
be finitely complete. Indeed, there may he A, B, C £ TZ such that AU77UC £ TZ, 
but neither A U 77, nor AU C, nor 77 U C are in TZ, so that 77(A U 77 U C) = 
77 (A) V n{B)\/ n{C) need not hold for a partial possibilistic measure defined on 
TZ. Each partial possibilistic measure on 7^ is a special case of the so called partial 
fuzzy measure; it is a mapping 77 : 7^ — >■ [0, 1] such that (i) of Definition 1.1 holds 
and the inequality II (A) < II{B) is valid for each A, B £ TZ such that A <£ B. 

Partial lattice-valued possibilistic measures were investigated in detail by 
De Cooman [2] under the rather strong and simplifying assumption that the 
domain TZ C 'P{fT) on which the possibilistic measure in question, say 77, is 
defined, is the so called ample field, i. e., that TZ is closed w.r. to all unions, 
intersections, and complements. Given to £ 12, the atom = n{ci £ TZ : uj £ 
TZ} is defined, so yielding an equivalence relation on f2, according to which 
LO\ se-ji u }2 iff = [uJ 2 ]'R. holds. As a matter of fact, a partial possibilistic 

measure on TZ can be replaced by a total possibilistic measure on the power-set 
of all subsets of the factor-space f2/ fUn, so that the greatest part of the notions, 
methods and results used in the case of total possibilistic measures (possibilitic 
distributions, e.g.) can be applied. 

Nevertheless, in what follows, we intentionally aim to keep our reasoning at 
the level as general as possible, not introducing some specific assumptions as 
far as the structure of the system TZ is concerned. This approach is motivated 
by the idea to develop a tool enabling to process, to the degree as possible, 
collections of pieces of information concerning qualitative possibility preferences 
given to particular pairs of phenomena (events). A more detailed description and 
discussion on this motivation can be found in [8], but the limited extend of this 
contribution does not allow to introduce it here. 

Applying the idea used in the standard measure theory, we arrive at the 
notion of inner and outer measure. 

Definition 1.2. Let f2 and TZ be as in Definition 1.1, let 77 be a mapping which 
takes TZ into [0, 1]. The inner (lower) measure 77, and the outer (upper) measure 
77* induced on 'P{f2) by 77 are defined, for each A C 17, by 

77,(A) = V{77(B) : B C A, B £TZ], (1.1) 

77*(A) = A{n(B) : B D A, B £TZ}, (1.2) 



here V smd A denote the standard supremum and infimum in [0, 1]; for the empty 
subset of [0, 1] the conventions V 0 = 0 and A 0 = 1 apply (or we enrich TZhy % 
and/or f2). 
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As can be easily seen, if 77 is a partial fuzzy measure on TZ, 77* and 77* 
are fuzzy measures on 7^(17), both extending conservatively 77 from TZ to 'P(n), 
so that IIt{A) = n{A) = n*{A) holds for each A £ TZ (only the inequality 
n*{A) < TJ*{A) is valid in general for A C 17). On the other side, if 77 is a partial 
possibilistic measure on TZ, neither 77* nor 77* need be possibilistic measures on 
7^(17) unless some rather strong supplementary conditions are satisfied (cf. [7] 
for some results concerning inner and outer possibilistic measures). 

An alternative way to extend 77 from its domain TZ C 7^(17) to the whole 
power-set 'P(fT) reads as follows. Set, for every w G 17, 

7To(u;) = A{77(7?) : w G 7? G 7^}. (1.3) 

Hence, if the singleton {w} is in TZ, then obviously = 77({w}). Supposing 

that ttq is a possibilistic distribution on 77, i.e., that set for 

each A c 17 

77o(A) = v^6a7To(w), (1.4) 

so obtaining a complete and total possibilistic measure TIq on 7^(17). The con- 
dition Vwer 2 '^o(<^) = 1 need not hold in general, indeed, take an infinite set 17, 
take TZ containing just the empty set, all singletons {uj}, w G 17, and all infinite 
subsets of 17, and set 0 = 77({u;}) = 77(0) for each u; G 17, 77(A) = 1 for each 
infinite subset of 17. Consequently, 7ro(a;) = 77({u;}) = 0 for each w G 17, so that 
ttq does not define a possibilistic distribution on 17. 

Finally, given a partial possibilistic measure 77 defined on 0 7^ C 7^(17), the 

induced partial necessity measure N[j can be defined on the system TZ~ = {A C 
17 : 17— A G TZ} of all complements of sets from TZ, setting Nn{A) = 1—77(17— A) 
for each A G TZ~ . 

The aim of this contribution will be to introduce non-numerical (in partic- 
ular, lattice- valued) partial fuzzy and possibilistic measures and to modify the 
definitions of the induced mappings 77*, 77*, 77o and Nn to this case. 

2 Partial Lattice- Valued Possibilistic Measures 

For partially ordered sets, lattices. Boolean algebras, and related structures the 
reader is kindly asked to consult [1,5,9], or some more recent source. The outgoing 
notion of our further reasoning will be that of partially ordered set. 

Definition 2.1. Let T be a nonempty set. A binary relation < on T (i.e., a 
subset of Cartesian product T x T) is called a partial ordering, if it is reflexive, 
antisymmetric and transitive, i.e., if for each x G T, x < x holds, if for each 
X, y G T such that x < y and y < x hold simultaneously, the identity x = y 
follows, and if, for each x, y, z G T, if x < y and y < z hold simultaneously, 
then X < z also holds. If < is a partial ordering on T, then the pair T = (T, <) 
is called a partially ordered set (p.o.set). 

Given a p.o.set T = (T, <) and a nonempty subset S C T, its supremum 
(V'S', abbreviately) and inBmum AteS^ {aS, abbreviately) are defined in 
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the standard way, of course, \/S and/or /\S may be undefined for some S C 
T. A p.o.set T = (T, <) is called a complete lattice, if for each th ^ S d T 
the supremum \jS and the infimum f\S are defined. Hence, also the minimum 
(element) Oj- = /\T and the maximum (element) I7- = \/T are defined and we 
can set, for the empty subset of T, V0 = Or and A0 = Ir- 

Let us recall that the notion of complete lattice is still general enough to cover 
qualitatively different structures like, e. g., the unit interval of real numbers with 
their standard linear ordering, and a complete Boolean algebra, in particular, 
the power-set of all subsets of a given space, partially ordered by the relation of 
set-theoretic inclusion. The main difference between complete lattices in general 
and both the examples just mentioned consists in the fact that complete lattice 
lacks an operation of negation or complement, represented by the operation 1 — a; 
in the first example, and by the set-theoretic complement X — A, X being the 
space in question, in the second example. A partial remedy may read as follows. 

Definition 2.2. Let T = (T, <) be a complete lattice. For each t GT, its (pseudo-) 
complement is defined by 

= V{s G T : s At = Or}. (2.1) 

□ 

A dual approach according to which t^ = /\{s € T : s V t = Irj would be 
also possible, but we limit ourselves, in what follows, to (2.1). If = (B, V, A, -i) 
is a Boolean algebra and < is the standard partial ordering induced in B, i.e., 
x<y iff xAy = x (iff xV y = y), then the pseudo-complement in (B, <) agrees 
with the complement in B, hence, x‘^ = —^x whenever x‘^ is defined. In particular, 
for T = {V{X), c), A'^ = X — Afor each A C X. 

Definition 2.3. Let T = (T,<) be a complete lattice, let 17 be a nonempty set, 
let TZ he a nonempty system of subsets of 17. A mapping B : TZ ^ T is called 
a partial T- (valued) fuzzy measure on TZ, if 77(0) = Or and/or II (Q) = Ij- 
supposing that 0 G 7^ and/or 12 gTZ, and if the relation II (A) < 11(B) holds for 
each A, B G TZ such that A C B. A partial T-fuzzy measure 77 on 7?. is called a 
partial T-possibilistic measure on TZ, if the relation II(AU B) = 11(A) V 77(77) 
holds for each A, B gTZ such that AU 7? G 7^. A partial T-possibilistic measure 
77 on 7^ is called finitely complete, if 77(u7^o) = VReno^(^) holds for each 
finite subsystem TZq C TZ such that uT^o G 7?., and it is called complete, if the 
same relation holds for each 0 yf T^o C 7?. such that uT^o G TZ, here uT^o stands 
for UReTZo^- □ 

Also the definition of inner and outer measures copies the way applied in the 
case of numerical mappings. 

Definition 2.4. Let T, 17, and 7^ be as in Definition 2.3, let 77 be any mapping 
which takes TZ into T. The inner (or lower) measure 77* and the outer (or upper) 
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measure II* induced by 77 on 7^(17) are defined, for any A C 17, by 

n^A) = v{77(B) : B C a, B €TZ}, (2.2) 

n*{A) = A{n{B) : B D A, B €TZ}, (2.3) 

applying the conventions v® = Or ^nd A® = 1t> if necessary, or joining 0 and 
17 with TZ and ascribing them the requested values 77(0) = O7- and II (Q) = I7-. 

□ 

As in the case of numerical partial possibilistic measures. Definition 2.4 im- 
plies almost obviously, that if 77 is a partial 7^-fuzzy measure on TZ, both 77* 
and 77* are 7^-fuzzy measures on 7^(17) extending conservatively 77, so that 
n*{A) < n*{A) and II^{B) = II (B) = II* (B) holds for each A C 17 and each 
B€TZ. 

Dual mappings to those introduced in Definition 2.3 can be defined in two 
ways which are equivalent provided that some rather general conditions are 
fulfilled. 

Definition 2.5. Let T, 17, and IZ be as in Definition 2.3. A mapping 77 : TZ ^ T is 
called a dual partial T-(valued) possibilistic measure on TZ, if 77(0) = O7- and/or 
77(17) = I7- supposing that % &TZ and/or 12 G TZ and if 77(An77) = 77(A)A77(77) 
for every A, B G TZ such that AflT?. A dual partial T-possibilistic measure 77 on 
TZ is called (Enitely) complete, if 77(n7^o) = AReiZo^i^) holds for each (finite) 
% fiTZet CTZ such that flT^o = □ 

Definition 2.6. Let T, 17, and TZ be as in Definition 2.3, let 77 ■. TZ ^ T he 
a partial T-possibilistic measure on TZ, let TZ~ = {Acl7:17 — Ag TZ} be 
the system of all complements of subsets from TZ. The mapping Nn '. TZ~ — >■ T 
defined by 



Nn{A) = (77(17 - A)f = y{sGT-.sA 77(17 - A) = Or} (2.4) 

for every A G TZ~ is called the partial T- (valued) necessity measure induced by 
77 on TZ~ . □ 

Each dual partial T-possibilistic measure on TZ is obviously also a partial T- 
fuzzy measure on TZ. Also the fact that if 77 is a partial T-possibilistic measure 
on TZ, the mapping Nn is a dual partial T-possibilistic measure on TZ~, is almost 
evident (cf. Lemma 5.1 in [8]). The converse implication to the last one does not 
hold in general, as the following assertion states (cf. Theorem 5.1 in [8] for the 
proof). 

Theorem 2.1. Let T, 17, and TZ be as in Definition 2.3, let (Dfi = t holds for each 
t G T. Then there exists, for each dual partial T-possibilistic measure B : TZ ^ 
T, a partial T-possibilistic measure IIx; on TZ~ such that B is identical with Nn 
on TZ (as a matter of fact, the only we have to do is to set IIs{A) = (17(17 — A))‘^ 
for each A G TZ~). □ 




400 



I. Kramosil 



Let us recall that the relation = t does not hold in general. If T = 
{V{X), c) for some X then this is the case, as ^ — A) = A 

trivially holds for each A C However, if T = ([0, 1], <), then for each t € (0, 1) 
we obtain that = v{s € [0, 1] : s A t = 0} = 0, so that = v{s G 

[0,1] : s A 0 = 0} = 1 > t holds. 

We have chosen complete lattices as the structures in which possibilistic 
measures take their values, as complete lattices are perhaps the most specific 
structures still general enough to cover the two most often used particular cases: 
set-valued (or, slightly generalizing, boolean-valued) possibilistic measures with 
values partially ordered by set inclusion, and real-valued possibilistic measures 
taking their values in the unit interval equipped by its standard linear ordering. 
For these reasons we prefer also to define the notion of (pseudo)-complement in 
a way general enough to need just the structure of complete lattice, even if the 
resulting operation is not compatible neither with the set-theoretic complement 
X — if T = (P(X),c), nor with the standard abstraction 1 — • in ([0, 1], <). 
An alternative approach would be to enrich the complete lattice P = {T, <) 
by a new (and ontologically independent) operation of negation or complement 
meeting some reasonable conditions axiomatically imposed on it. However, let 
us postpone a more detailed development of this idea till another occasion. 



3 Inner, Outer, and Necessity Measures Induced by a 
Partial 7~-Possibilistic Measure 

As a matter of fact, if 77 is a T-possibilistic measure on 7?. C 7^(17), then neither 
77* nor 77* need be a T-possibilistic measure on V{0). Indeed, take TZi = {0, 72} 
and set 77i(0) = O7-, 77i(72) = I7-, so that 77i is the most trivial partial T- 
possibilistic measure. Given a nonempty proper subset A of 72, we obtain that 
77i*(A) = 77i*(72 — A) = O7-, so that 77i*(A U (72 — A)) = 77i*(72) = I7- > 
O7- = 77i*(A) V 77i*(72 — A) follows. Now, take TI 2 = {0> B, 72}, where A and 
B are mutually disjoint nonempty subsets of 72 such that A U 77 hence, 

A U 7? ^ 7^, and set 772(0) = 772(A) = 772(77) = O7-, 772(72) = I7-. So, II 2 is a 
partial T-possibilistic measure on 7^2 and we obtain that 77|(72) = 77|(AU77) = 
Ir > 7T|(A) V 772 (77) = O7- V O7- = O7-. Consequently, neither 77i* nor 77^ are 
T-possibilistic measures on T(72). 

The next statement introduces some sufficient conditions under which the 
inner measure 77* induced by a partial T-possibilistic measure 77 on 7?. is also a 
T-possibilistic measure. 

Theorem 3.1. Let T, 72, and TZ be as in Definition 3.2, let 77 be a partial T- 
possibilistic measure on TZ. Set 

7^o = {A c 72 : AnC G 7^ for each C G 7^}. (3.1) 

Then 77* is a partial T-possibilistic measure on T^o- □ 
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Proof. Cf. Theorem 6.1 in [8]. 

The three following remarks are worth being stated explicitly. 

(i) If TZ is closed with respect to intersections, i. e., if Ad B G TZ holds for 
each A, B €TZ, then the inclusion TZ C TZq easily follows, so that 7T* extends II 
conservatively from TZ to TZq. 

(ii) If TZ is closed w.r.to subsets, i. e., if C € TZ and D C C implies that 
D G TZ, and if IT G TZ, then TZq = P(f2)(such systems are called hereditary in 
[6]). 

(iii) If n is (finitely) complete partial T-possibilistic measure on TZ and if the 
other conditions of Theorem 3.1 are satisfied, then iT* is a (finitely) complete 
partial T-possibilistic measure on TZq. 

A partially ordered set T = (T, <) is called continuous from above (),- 
continuous, abbreviately), if for each x G T the infimum of all elements greater 
than X is defined and identical with x, in symbols, if a; = /\{t € T : t > x} holds 
for each x GT. 

Theorem 3.2. Let 12 he a nonempty set, let T be a nonempty system of subsets 
of 12 closed w.r.to unions, i. e., AU B G TZ for each A, B G TZ, let T be a 
4,-continuous complete lower semilattice and (not necessarily complete) upper 
semilattice, let II he a partial T-possibilistic measure on TZ. Then II* is a T- 
possibilistic measure on TZ{I2). 

Proof. Cf. Theorem 6.2 in [8]. 

Given a partial T-fuzzy measure II on a system TZ of subsets of a nonempty 
set 12, and supposing, in order to simplify further considerations and reasonings, 
that {0, 12} C TZ, we may either extend it to 'P{12), defining 77* and 77* as above, 
or we may define the necessity measure Nn on TZ~ = {A C 12 : 12 — A G TZ}, 
setting Nn{A) = {II{I2 — A)Y for every A G TZ~ . These operations can be 
applied also sequentially, step by step, and in different order. So, we arrive at 
if 77* is defined first, or to {Nn)*, if Nn on TZ~ is defined and then 
extended to 'P{12). Also the dual cases N(^n-) and (Nn)* will be investigated. 

Theorem 3.3. Under the notations and conditions introduced, the inequality 

{Nn)*{A) < N(^n,M) (3.2) 

holds for each A C 12. 

Proof. For each A C 12 we obtain that 

{Nn)*{A) = V {Nn{B) : B c A, B G TZ~} = (3.3) 

= V {{n{12 - T))= -. I2-BZ)I2-A, I2-BgTZ} = 

= V {v{s GT:sA n{12 -B) = 0r}: 12- BZ) 12- A, 12 - BgTZ}. 
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For each — BgTZ and Q — B Z:> Q — A holds, then the inequality 

n{Q-B)>n*{Q-A)>n^{n-A) (3.4) 

follows. Consequently, for each B C — B G TZ, and each s G T, ifsAiT(l7 — 
B) = O7-, then also s A iT,(C — A) = Oj- holds. Hence, the inclusion 

U {{s G T : s A iT(l7 - B) = Or} : B C f2, f2 - B € 7Z, - B D f2 - A} (3.5) 

C {s G T : s A I7^(f2 - A) = Or} 

is valid. Taking the suprema of both the sets in (3.5) and applying (3.3), we 



obtain that 

iNn)*{A) < v{s G T : s A iT*(C - A) = Or} = (7T*(C - A)Y = (3.6) 

= N^n,M). 

The assertion is proved. 

The most trivial example illustrates that equality in (3.2) need not hold in 
general. Take TZ = {0, 17}, 77(0) = Or, 7f(l7) = Or. Then 

iVr(0) = (7T(C - 0))^ = (77(17))^ = = Or, (3.7) 

Nn{n) = {n{Q - n)Y = (77(0))" = 0^ = Ir. (3.8) 

Hence, for each 0 yf H yf 17, H C 17, we obtain that 

(Nn)4A) = \/{Nn{B) :BcA,BGn~} = 7Vr(0) = Or, (3.9) 

as 0 is the only subset of A which is in TZ~ . However, 

N(^n,){A) = {n^{Q - A)Y = (v{77(H) : B c Q - A, B g 7^})" = (3.10) 
= (77(0))" = 0^ = Ir, 



as 0 is the only subset of 17 — H which is in TZ, so that the inequality {Njj)*{A) < 
N(^[j,){A) follows. 

The assertion dual to Theorem 3.3 reads as follows. 

Theorem 3.4- Under the notations and conditions introduced, the inequality 

{N„r{A) > ( 3 . 11 ) 

holds for each H C 17. 

Proof. Analyzing the definitions of {Njj)*{A) and we obtain that 

(7Vr)*(A) = A{Nn{B) : B D A, B G TZ~} = 

= A {(77(17 -H))" :BdA,Bg 7^~} = 

= a{[v{s G T :sAn{n-B) = Or}] :BG)A,Bg TZ^} , 



( 3 . 12 ) 
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= {n*{n- A)f = v{s G r :sAn*{Q-A) = Or}. (3.13) 

Take A C fi, take B G TZ~ such that B D A holds, then fi — B G TL, 
Q — B C f2 — A and, consequently, also 

n{n-B) < n^{n-A) = \/{n{c) c c n-A, c gtz} < n*{n-A) (3.14) 

follows. Hence, for each s G T such that sAn*{Q — A) = Or, also sAlI{fi — B) = 
Or holds, so that the set inclusion 

{s G T : s A n*{n - H) = Or} C {s G T : s A 77(12 - B) = Or} (3.15) 
is valid for each B G 7 ?.“, B D A. Consequently, for each such B we obtain that 
V {s G T : s A 77*(C - H) = Or} < V {s G ^ : s A 77(12 - B) = Or} , (3.16) 

so that the inequality 



V{s G T : s A7T*(12-H) = Or} < (3.17) 

< A {[V{s G T : s A 77(12 - B) = Or}] : B G 7^“, B D A} 

follows. Due to (3.12) and (3.13) this inequality immediately implies (3.11), so 
that the assertion is proved. □ 

The same most trivial example as above with TZ = {0,12}, 77(0) = Or, 
and n{f2) = Ij- demonstrates that also in (3.11) the equality need not hold in 
general. Again, taking A C 12, 0 A 12, we obtain that 

(NnnA) = A{Nn{B) : B D A, B G TZ~} = Nn{n) = Ir, (3.18) 



but 

iV(r.)(A) = (77*(12 - A))" = (a{77(C') : C D n - A, C G 71}^ = (3.19) 
= (77(12))^ = = Or, 

so that the strict inequality {Nn)*{A) > N(ij,'f{A) follows. 

4 Possibilistic Completion of Partial X-Possibilistic 
Measures 

The next statement introduces the lattice- valued modifications of the mappings 
ttq and 77o, defined by (1.3) and (1.4) for the numerical case and proves, under 
which conditions the modified mappings define T^-possibilistic distribution of 12 
and T-possibilistic measure on 7^(12). 
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Theorem 4-1- Let T be a complete lattice, let 17 be a nonempty set, let TZ 
be a system of subsets of Q containing 0 and 17, let 7T be a complete partial 
T-possibilistic measure on TZ. Set, for each u & fl, 

7To(u;) = A{7I(7?) : w e i? G ??•}■ (4.1) 

Then 7 To(w) : 17 — >• T is a T-possibilistic distribution on 17, i. e., Vwer 2 ’’’o(‘^) = 

ir- □ 

Proof. Setting TZ^i = {R G TZ : uj £ R} for each w G 17, we have to prove that 

y^^o{A{n{R) : RgTZ^}) = 1t. (4.2) 

First, suppose that v{l & T : t < I7-} = to < 1, and suppose that (4.2) does 
not hold. Hence, A{T(i?) : R G TZt^} < to for each w G 17 follows. If T(i?) > to 
holds for each R G TZcu, then U{R) = I7- for each such R holds and A{T(i?) : 
R G TZu}} = 1-r > to follows. So, for each w G 17 there exists R^^ G TZui such that 
n{Ruj) < to holds. However, uj G Rui, so that \JujeoRuj = 17 and II {{JuienRu}) = 
V(^er 2 Ll(i?^) = Ir follows due to the supposed completeness of II on IZ. On the 
other side, II{R^) < to for each w G 17 yields that \fu;eon{Ruj) < to < f should 
be valid - a contradiction. 

Hence, suppose that \/{t G T : t < I 7 -} = Ir and suppose that, formally 
written. 



(3t G T, t < Ir) (Vw G 17) {3R^ G TZ^) {n{R^) < t) . (4.3) 

Using the same way of reasoning as above, we arrive, again, at the contradiction 
that \JcjeoRuj = 17, but Vioenn{Ruj) < 1 < Ir should hold. What remains reads 
that v{l &T :t < Ir} = Ir and the negation of (4.3), i.e., 

(Vt G T, t < 1) (3w G 17) (VT^ G TZ^) {n{R^) > f) (4.4) 

hold together. However, (4.4) implies that 

(Vt G T, t < 1) (3 w G 17) (7To(u;) > t), 

so that 

Vwer 2 ’^o(w) = V{1 G T : t < Irj = Ir 
follows. The assertion is proved. 

Hence, supposing that ttq is a T-possibilistic distribution on 17 and setting 
IIo{A) = v{7To(w) : UJ £ A} for every H C 17, we obtain that IIo is a complete 
T-possibilistic measure on T(17). The inequality IIo{A) < II{A) for every A £ IZ 
easily follows, as H G IZ^j holds for each w G H, so that tto{uj) < II {A) is valid 
for every u £ A, hence, also for IIo{A). This inequality cannot be, in general, 
replaced by equality, as the following simple example demonstrates. 

Let 17 = {1,2,3,4,51, let Hi = {1,2}, A 2 = {3,4}, H 3 = {2,3}, let 7Z = 
{0, Hi, H 2 , H 3 , 17}, let 17 : 7Z T he such that 77(0) = 77 (Hi) = 77 (H 2 ) = Or, 



(4.5) 

(4.6) 

□ 
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n{A^) = n{Q) = I7-. 7T is a partial T-possibilistic measure on TZ and ttq is a 
T-possibilistic distribution on fi. Indeed, for each i = 1,2, 3, 4, 

TTo{i) = h{n{R) ■ i € R&n} = Qr, (4-7) 

but 7 To( 5) = 77(17) = I7-. Consequently, 

77 o(7l3) = 7To({2,3}) = 7 To( 2) V ^o(3) = Or < Ir = n{A^). (4.8) 

For singletons the identity 7To({w}) = 7ro(w) = 77({w}) obviously holds for 
each w G 77 such that {uj} GTZ.lfTZ is closed with respect to intersections, i. e., 
if An B G TZ holds for each A, B G TZ, then IIq{A) = II{A) for every A G TZ 
supposing that 77 is complete on TZ (cf. [8] for more detail). 

As in the case of inner and outer measure, we can define the necessity mea- 
sures Nn, {Nn)o, and 7 V(tio)- The following inequalities between their values can 
be proved. 

Theorem 4-2. Let O let {0, 77} c 77. C 7^(77), let T = (T, <) be a complete 
lattice, let 77 be a partial T-possibilistic measure on TZ. Then, for each A G 77“ , 
the inequality 

(NnUA) < Nn{A) < N(^noM) (4-9) 

holds. □ 

Proof. Cf. Lemma 8.2 in [8]. 

No inequality relation valid in general binds the values of the mappings TIq 
and 77*. Indeed, take 77 = {0, 77} with 77(0) = Oj- and = Ir- Then, for 

each u; G 77, 7 To(w) = 77(77) = Ir, so that 77o(A) = Ir for every 0 A C 77. On 
the other side, 77* (A) = 77(0) = Or for every A C 77, A 77^ so that the strict 
inequality 77* (A) < IIo{A) for every % ^ A ^ Q follows. However, recalling the 
example with 77 = {1,2, 3, 4, 5} from above, we remember that 77o(A3) = Or, 
but 77*(A3) = n{A^) = Ir, so that the inequality IIq{A^) < 77*(A3) holds. 

One of the reviewers of this contribution suggests still another interesting way 
how to extend a partial T-possibilistic measure 77 from its domain 7^ C T(77) to 
the whole T(77). Namely, denote by Dn the set of all T-possibilistic distributions 
7T : 77 — >• T such that VweA^('^) = R{A) holds for each A G 7^, in other words, 
the total possibilistic measures defined by distributions from 77 77 agree with 77 on 
TZ. Supposing that the set D[j is nonempty and that it contains the maximum 
element in the pointwise sense, i.e., that there exists ttq G Dn such that the 
inequality 7 To(w) > 7r(w) holds for each u; G 77 and each tt G Dn, the total 
possibilistic measure on T(77) defined by ttq is defined as the extension of 77 
from TZ to T(77). The idea is certainly worth being investigated in more detail, 
but what would be necessary, first of all, is to find some nontrivial sufficient 
conditions under which the set 7? 77 is nonempty and dominated by a pointwise 
maximum element. As such investigations seem to be far from being trivial, let 
us postpone them till another occasion, as well as a more detailed analysis of 
possible relations between the approaches introduced above and the last one. 
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Abstract. In previous papers, by resorting to the most effective con- 
cept of conditional probability, we have been able not only to define fuzzy 
subsets, but also to introduce in a very natural way the basic continu- 
ous T-norms and the relevant dual T-conorms, bound to the former by 
coherence. Moreover, we have given, as an interesting and fundamental 
by-product of our approach, a natural interpretation of possibility func- 
tions, both from a semantic and a syntactic point of view. 

In this paper we study the properties of a coherent conditional prob- 
ability looked on as a general non-additive uncertainty measure of the 
conditioning events, and we prove that this measure is a capacity if and 
only if it is a possibility. 



1 Introduction 

The starting point of our approach is a synthesis of the available information 
(and possibly also of the modalities of its acquisition), expressing it by one or 
more events', to this purpose, the concept of event must be given its more general 
meaning, i.e. it must not looked on just as a possible outcome (a subset of the 
so-called “sample space”), but expressed by a proposition. 

Moreover, events play a two-fold role, since we must consider not only those 
events which are the direct object of study, but also those which represent the 
relevant “state of information” : in fact conditional events and conditional prob- 
ability are the tools that allow to manage specific (conditional) statements and 
to update degrees of belief on the basis of the evidence. 

On the other hand, what is usually emphasized in the literature - when 
a conditional probability P{E\H) is taken into account - is only the fact that 
P{-\H) is a probability for any given PI : this is a very restrictive (and misleading) 
view of conditional probability, corresponding trivially to just a modification of 
the so-called “sample space” 12. 

It is instead essential - for a correct handling of the subtle and delicate prob- 
lems concerning the use of conditional probability - to regard the conditioning 
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event H as a “variable”, i.e. the “status” of H in E\H is not just that of some- 
thing representing a given fact, but that of an (uncertain) event (like E) for 
which the knowledge of its truth value is not required (this means, using a ter- 
minology due to Koopman [15], that E[ must be looked on - even if asserted - as 
being contemplated: similar terms are, respectively, acquired versus assumed). 

The concepts of conditional event and conditional probability (as dealt with 
in this paper) play a central role for the probabilistic reasoning. In particular, 
here we study the properties of a coherent conditional probability looked on as 
a general non-additive uncertainty measure of the conditioning events, and we 
prove that this measure is a capacity if and only if it is a possibility. 



2 Conditional Events and Coherent Conditional 
Probability 

In a series of paper (see, e.g., [5] and [7], and the recent book [8]) we showed 
that, if we do not assign the same “third value” t{E\H) = u (undetermined) to 
all conditional events, but make it suitably depend on E\E[, it turns out that 
this function t{E\E{) can be taken as a general conditional uncertainty measure 
(for example, conditional probability and conditional possibility correspond to 
particular choices of the relevant operations between conditional events, looked 
on as particular random variables, as shown, respectively, in [5] and [1]). 

By taking as partial operations the ordinary sum and product in a given 
family C of conditional events, with the only requirement that the resulting 
value of t{E\H) be a function of (E,E[), we proved in [5] that this function t(-j-) 
satisfies “familiar” rules. 

In particular, if the set C = Q x B of conditional events E\H is such that Q 
is a Boolean algebra and C ^ is closed with respect to (finite) logical sums, 
then, putting t(-j-) = P (-|-)5 these rules can be expressed as follows, where 
B° = B\{9}: 

(i) P{El\H) = 1, for every El G B° 

(a) P{-\El) is a (finitely additive) probability on A for any given El G B° 

(iii) P{{E A A)\H) = P(E\H) ■ P{A\{E A H)), for every A G A and E, 
H gB°, E EH 

The function P(-j-) is called a conditional probability on Q x B°, and these rules 
coincide with the usual axioms due to de Finetti [12], R&yi [17], Krauss [16], 
Dubins [13]. 

And what about an assessment P on an arbitrary set C of conditional events? 
We will say that the assessment P(-|-) on C is coherent if there exists C' D C, 
with C = Q X B° {G a Boolean algebra, B an additive set), such that P(j-) can 
be extended from C to C' as a conditional probability . 

We list the peculiarities (which entail a large flexibility in the management 
of any kind of uncertainty) of this concept of coherent conditional probability 
versus the usual one: 
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- due to the direct assignment of P{E\H) as a whole, the knowledge (or the as- 
sessment) of the “joint” and “marginal” unconditional probabilities P{EAH) 
and P{H) is not required; 

- the conditioning event El (which must be a possible one) may have zero 
probability, so that the class of admissible conditional probability assessments 
and that of possible extensions are larger (and the ensuing algorithms are 
more flexible); 

- it allows a management of stochastic independence (see, e.g., [3]) (conditional 
or not) which avoids many of the usual inconsistencies related to logical 
dependence. In fact, the latter situation may arise in the usual probabilistic 
approach, for example when dealing with graphical models; 

- a suitable interpretation of the extreme values 0 and 1 of P{E\El) for situa- 
tions which are different, respectively, from the trivial ones E A El = 9 and 
H C E, leads to a “natural” treatment of the default reasoning [11]; 

- it is possible to represent “vague” statements as those of fuzzy theory (as 
done in [4], [6], [9]). 

The following characterization of coherence has been discussed in many previous 
papers: here we adopt the formulation essentially given in [2] (see also [8]). 

Theorem 1. Let C be an arbitrary family of conditional events, and consider, 
for every n C N, any finite subfamily 

T={E^\Hi,...,E^\H^}CC- 



denote by Ao the set of atoms Ar generated by the (unconditional) events 
El, Hi, , En, For a real function P on C the following three statements 
are equivalent: 

(a) P is a coherent conditional probability on C; 

(b) for every finite subset P C C, there exists a class of coherent (uncondi- 
tional) probabilities {P^} (to avoid a cumbersome notation, in the sequel we 
put P^ = Pa), each probability Pa being defined on a suitable subset Aa C Ao, 
and for any Ei\Hi G C there is a unique Pa satisfying 

^ Pa(Ar)>0, (1) 

ArCHi 



and 



P{E,\H,) 



Er Pc{Ar) 

ArQEiAHi 



J2r Pc{Ar) 

Ar-CHi 



(2) 



moreover Aa' C Aa" for a' > a” and Pa"{Ar) = 0 if G Aa'', 
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(c) for every finite subset T C C, all systems of the following sequence, with 
unknowns x“ = Pa{Ar) > 0 , € Aa , a = 0, 1, 2, . . . , fc < n , are compatible: 






A,.dEif\Hi 



Ar-CHi 



{Sc.) { 



[for all Ei\Hi £ P such that ^ = 0 , a > 1,] 

ArCHi 






A^CIHS 



where H° = Ho = Hi V ... V Hn and ^ = 0 for all Hi's, while 

ArCHi 

denotes, for a > 1, the union of the Hi's such that = 0 . • 

ArCHi 

Any class {Pa} singled-out by the condition (b) is said to agree with the 
conditional probability P. Notice that in general there are infinite classes of 
probabilities {Pa}', in particular we have only one agreeing class in the case 
that C is the product Q y. Q° (with Q Boolean algebra). 



3 Zero-Layers 

We recall now the concept of zero-layer [3], which naturally arises from the 
nontrivial structure of conditional probability brought out by Theorem 1. (We 
refer, for simplicity, only to finite sets of conditional events). 

Definition 1. Given a class V = {Pa}, agreeing with a conditional proba- 
bility in the sense of the characterization Theorem 1, it naturally induces the 
zero-layer o{H) of an event H, defined as 

o{H) = a iiPa{H)>Q, 
and the zero-layer of a conditional event E\H as 

o{E\H) = o{E ^ H) - o{H). 

The zero-layers single-out a partition of the family of the conditioning events: 
on the other hand we may have, if E is not one of them, Pa{E) = 0 for every 
a = 0, 1, 2, . . . , fc . Nevertheless, as shown in [5], it is possible to assign to it an 
arbitrary probability Pk+i{E) > 0 , so that o{E) = k -\- 1 . 

Obviously, for the certain event fl and for any event E with positive proba- 
bility, we have o(f7) = o{E) = 0 (so that, if the class contains only an everywhere 
positive probability Pq, there is only one (trivial) zero-layer, i.e. a = 0), while 
we put o(0) = - 1 - 00 . Clearly, 



o{A V i?) = min{o(A), o(B)}. 
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Moreover, notice that P{E\H) > 0 if and only if o(E A i?) = o(H), that is iff 
o{E\H) = 0. 

Notice that the zero-layer (which is obviously significant mainly for events 
of zero probability) is a tool to detect “how much” a null event (conditional or 
not) is ... null. 

For the connection of the concept of zero-layer to Spohn’s ranking function, 
see the discussion in Sect. 12.3 of the book [8]. 



4 Coherent Extensions 

A fundamental result is the following, essentially due (for unconditional events, 
and referring to an equivalent form of coherence in terms of betting scheme) to 
de Finetti [12]. 

Theorem 2. Let C be a family of conditional events and P a corresponding 
assessment; then there exists a (possibly not unique) coherent extension of P to 
an arbitrary family K.^ C, if and only if P is coherent on C. 

The following theorem (see [8]) shows that a coherent assignment of P(-j-) 
to a family of conditional events whose conditioning ones are a partition of 17 is 
essentially unbound. 

Theorem 3. Let C be a family of conditional events {Ei\Hi\i^i, where 
card{I) is arbitrary and the events Elfs are a partition of 17. Then any function 
/ : C — >■ [0, 1] such that 

f{E,\H,) = t) if E,AH, = iD, and f{E,\H,) = l if H, C E, (3) 
is a coherent conditional probability. 

Proof. Coherence follows easily from Theorem 1; in fact, for any finite subset 
P C C we must consider the relevant systems (So) ■ each equation is “indepen- 
dent” from the others, since the events Elfs have no atoms in common, and so 
for any choice of P{Ei\Hi) each equation (and then the corresponding system) 
has trivially a solution (actually, many solutions). 

Corollary 1. Let C be a family of conditional events {E\Hi}i^j, where 
card{I) is arbitrary and the events Elfs are a partition of 17, and let P(-j-) 
be a coherent conditional probability such that P{E\P[i) G {0,1}. Then the 
following two statements are equivalent 

(i) P(-|-) is the only coherent assessment on C ; 

(ii) Eli A E = % ior every Eli G 'Ho and Hi C E for every Hi G Hi , 
where Hr = {Hi : P{E\Hi) = rj , r = 0, 1 . 

The latter two theorems constitute the main basis for the aforementioned 
interpretation of a fuzzy subset (through the membership function) and of a 
possibility distribution in terms of coherent conditional probability. 
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5 Fuzzy Sets and Possibility 

We recall from [9] the following two definitions (Definitions 2 and 3 below). 

If X is a (not necessarily numerical) random quantity with range Cx, let 
Ax, for any x G Cx, be the event {X = a:}. The family {Ax}xeCa, is obviously a 
partition of the certain event fi. 

Let ip be any property related to the random quantity X : notice that a 
property, even if expressed by a proposition, does not single-out an event, since 
the latter needs to be expressed by a nonamhiguous statement that can be either 
true or false. Consider now the event 

Eip = “You claim 

and a coherent conditional probability P{E,p\Ax), looked on as a real function 

= P{E,^\Ax) 

defined on Cx- 

Since the events Ax are incompatible, then (by Theorem 3) every /i£;^(x) 
with values in [0, 1] is a coherent conditional probability. 

Remark. Given xi,X 2 € Cx and the corresponding conditional probabili- 
ties pE.f{xi) and piE,f,{x 2 ), a coherent extension of P to the conditional event 
E,p\{Ax^\/ Ax^) is not necessarily additive with respect to the conditioning events 
(yet, see the Remark following Corollary 2). 

Definition 2. Given a random quantity X with range Cx and a related 
property tp, a fuzzy subset if* of Cx is the pair 

with pe^{x) = P{E^\Ax) for every x&Cx- 

So a coherent conditional probability P{Eip\Ax) is a measure of how much 
You, given the event Ax = {X = a;}, are willing to claim the property p , and 
it plays the role of the membership function of the fuzzy subset if*. 

In [6] we have been able not only to define fuzzy subsets, but also to introduce 
in a very natural way the basic continuous T-norms and the relevant dual T- 
conorms, bound to the former by coherence. In fact, given a T-norm (that in 
this framework singles-out the value P{E^p A if^|Ra, Ay) of the conjunction), 
then the corresponding choice of the T-conorm (which determines the value of 
the disjunction) is uniquely driven by the coherence of the relevant conditional 
probability (and the dual operation is what is actually obtained). 

Definition 3. Let E be an arbitrary event and P any coherent conditional 
probability on the family Q = {if} x {Ax}x<^Cx^ admitting P{E\fl) = 1 as 
(coherent) extension. A distribution of possibility on Cx is the real function tt 
defined by tt{x) = P{E\Ax). 
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6 Conditional Probability as a Nonadditive Uncertainty 
Measure 



Actually, the previous definition can be used as well to introduce any general 
distribution (p, to be called just uncertainty measure. 



Definition 4. Let E be an arbitrary event and P any coherent conditional 
probability on the family Q = {E} x {Ax}x^Cx^ admitting P{E\fl) = 1 as 
(coherent) extension. A distribution of uncertainty measure on Cx is the real 
function (p defined by (p{x) = P{E\Ax). 

Remark. When Cx is finite, since every extension of P{E\-) must satisfy 
axioms (i), (ii) and (Hi) of a conditional probability, condition P{E\fl) = 1 gives 



Then 



P{E\Q) = Y, P{Ax\C)P{E\Ax) and ^ 

X^Cx X^Cx 



1 = P{E\C) < u\&yiP{E\Ax ) ; 

xeCx 



1 . 



therefore P{E\Ax) = 1 for at least one event A^. 

On the other hand, we notice that in our framework (where null probabilities 
for possible conditioning events are allowed) it does not necessarily follow that 
P{E\Ax) = 1 for every x] in fact we may will have P{E\Ay) = 0 (or else equal 
to any other number between 0 and 1) for some y G Cx ■ 

Obviously, the constraint P{E\Ax) = 1 for some x is not necessary when the 
cardinality of Cx is infinite. 



We will study now the properties of coherent extensions of the function p, 
seen as coherent conditional probability P(A|-), to the algebra spanned by the 
events A^. 

First of all we note that if A is an algebra, E an arbitrary event and P{E\-) 
a coherent conditional probability on {E} x A° , then function /(•) defined on 
A by putting /(0) = 0 and /(A) = P{E\A) is not necessarily additive. On the 
contrary (as we will see in the following Theorem 4) / is necessarily sub-additive 
(0-alternating). 

Lemma 1. Let C be a family of conditional events {E\Hi}i^i, where card{I) 
is arbitrary and the events Hfs are a partition of fl , and let P{E\-) an arbitrary 
(coherent) conditional probability on C. Then any coherent extension of P to 
C = {E\P[ : H G'H°} , where H is the algebra spanned by Hi, is such that, for 
every H, K G H, with H /\ K = % 



TCLin{P{E\H),P{E\K)} < P{E\H V K) < max{P{E\H), P{E\K)} . 



Proof. By Theorem 2, P can be extended to a coherent conditional probability 
on C' , and the latter in turn can be extended to a coherent conditional probability 
on C” = C' U {H\K ■. H,K G H,}. This satisfies, by axiom (Hi) of a conditional 
probability, 

P{E\H \/ K)= P{E\H)P{H\H H K) + P{E\K)P{K\H V K ) , 
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for every H A K = 0 . The conclusion follows, since - by axioms (i) and (ii) — 
we have P{H\H W K) + P{K\H W K) = 1. 

Corollary 2. Let C be a family of conditional events {E\Hi}i^j, where 
card{I) is arbitrary and the events iL^’s are a partition of 17, and let P{E\-) 
an arbitrary (coherent) conditional probability on C. Then any coherent exten- 
sion ofP toC' = {E\P[ : H G 'H°} , where % is the algebra spanned by Hi, is 
such that, for every H,K G "H, with H A K = 9 , 

P{E\H \/ K)< P{E\H) + P{E\K). 

Proof. The prof is a trivial consequence of Lemma 1 . 

Remark. Can equality hold in the previous relation? If this happens, it 
means (by Lemma 1) that either P{E\H) or P{E\K) is equal to zero, so that, if 
P{E\n) > 0, an easy induction leads to the conclusion that for all elements of 
the partition except one we have P{E\Hi) = 0 . 

Recalling that a function / is 2-alternating if, for every H, K, we have 

f{H VK)< f{H) + f{K) - f{H A K ) , 

the following example shows the existence of coherent extensions of the con- 
ditional probability P to C = {E\H : H G T~L°} which are not 2-alternating 
measures. 

Example. Let Hi, H 2 , H^ be a partition of 17 and E an event logically 
independent of Hi. Consider the following assessment: 

P{E\Hi) = 1/2 , P{E\H 2 ) = 1/3 , P{E\H 3 ) = 1 , 

P{E\Hi V H 2 ) = 5/12 , P{E\Hi V P 3 ) = 1 , P{E\Hi V P 2 V H 3 ) = P{E\H) = 1 . 
This assessment is not 2-alternating, in fact we have 

1 = P(E|i7i V P 2 V P 3 ) > P{E\Hi\/ H 2 ) + P{E\Hi\/ H 3 ) - P{E\Hi) = 11/12. 

Nevertheless the assessment is a coherent conditional probability (and so admits 
a coherent extension to C'). By condition (c) of Theorem 1, to prove coherence 
consider the probabilities of the relevant atoms Xi = Po{E A Hi) and x' = 
Po{E‘^ A Hi) as unknowns in the following system: 

' Xi = \{xi + x'l) 

X 2 = \{X 2 + x' 2 ) 

X3 = (X3 + X's) 

Xi + X2 = ^(Xl -I- X2 -I- x'l + x'2) 

, Xi+ X3 = {Xi -I- X3 -I- x'l + x' 3 ) 

Xi H \-X2 + X3 = (xi -|- X2 + X3 -|- x'l x'2 x'3) 

3 

J 2 i^k+x'k) = 1 

k^l 

Xk>0 

which has the solution X3 = 1, xi = X2 = x( = 0 (z = 1, 2, 3 ). 
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Putting now yi = P\{E A Hi) and y' = P\{E^ A Hi) (i = 1,2), we need to 
consider the second system 

' 2/1 = \{yi+y'i) 

2/2 = 5(2/2 + 2/2) 

2/1 + 2/2 = ^(2/1 + 2/2 + y'l + 2/2) 

< 2 

fe=i 

2/fc > 0 

which has the solution yi = y[ = 5,2/2= gj 2/2 = 5 • 

We note that this 0-alternating (but not 2-alternating) function P{E\-) is 
not monotone with respect to C , i.e. there exist H,K € H, with H C K, such 
that P{E\H) > P{E\K): just take H = Hi and K = HiW H2. • 

So the function f{H) = P{E\H), with P a coherent conditional probability, 
in general is not a capacity. 

The following theorem focuses the condition that assures the monotonicity 
oiP{E[). 

Theorem 4. Let C be a family of conditional events {E\Hi]i(zi, where 
card{I) is arbitrary and the events Hi's are a partition of 17, and let P{E\-) 
an arbitrary (coherent) conditional probability on C. A coherent extension of P 
to C = {E\H : H G H,°} , where TL is the algebra spanned by Hi, is monotone 
with respect to C if and only if for every H,K G TL 

P{E\H y K) = w,aoL{P{E\H),P{E\K)}. 

Proof. The proof is a direct consequence of Lemma 1 . 

The question now is: are there coherent conditional probabilities P{E\-) 
monotone with respect to C ? 

We will reach a positive answer by means of the following theorem (given 
in [9]), which represents the main tool to introduce possibility functions in our 
context referring to coherent conditional probabilities. 

Theorem 5. Let E be an arbitrary event and C be the family of conditional 
events {E\Hi}i^j, where card{I) is arbitrary and the events Hi's are a partition 
of 17. Denote by TL the algebra spanned by the Hfs and let / : C — >■ [0, 1] be 
any function such that (3) holds (with Ei = E for every i G I). Then any P 
extending / on /C = {E} x TL° and such that 

P{E\Hy K) =ma.x{P{E\H),P{E\K)} , for every H , K G H.° . (4) 



is a coherent conditional probability. 
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Remark. If card{I) is infinite, there exists a coherent extension of the func- 
tion / as conditional probability on /C = {E} x TL°, satisfying (4), even if we 
put the further constraint P{E\n) = 1. (Recall that the ensuing assessment 
P{E) = 1 by no means implies E = f2). On the other hand, when card{I) is 
finite, this extension is possible only if P{E\Hi) = 1 for some i. 

Remark. If card (/) is finite, then for every El G TL° , 

P{E\H) = m^x P{Em . 

MidM 

So the knowledge of the function P on the given partition is enough to determine 
the whole conditional probability on 'H°. On the other hand, in the general case, 
if the event P[ is an infinite disjunction of elements of TL°, it is not necessarily 
true that the conditional probability P{E\H) is the superior of the P{E\Hi)’s. 

The above result allowed us to introduce a “convincing” (from our point of 
view) definition of possibility measure. On the other hand, for a classic approach, 
see the book [14]. 

Definition 5. Let TL be an algebra of subsets of Cx (the range of a ran- 
dom quantity X) and E an arbitrary event. If P is any coherent conditional 
probability on /C = {E} x 'H°, with P{E\f2) = 1 and such that 

P{E\HW K) = ma.x{P{E\H),P{E\K)} , for every H , K G n° , 

then a possibility measure on "H is a real function II defined by II{H) = P{E\H) 
for H gU° and 77(0) = 0. 

Remark. Theorem 5 assures (in our context) that any possibility measure 
can be obtained as coherent extension (unique, in the finite case) of a possibility 
distribution. Vice versa, given any possibility measure 77 on an algebra 77, there 
exists an event E and a coherent conditional probability P on 1C = {E} x 77° 
agreeing with 77, i.e. whose extension to {E} x 77 (putting P(7f |0) = 0) coincides 
with 77. 

The following Theorem 6, which is an immediate consequence of Theorems 
4 and 5, is the main result of this paper. 

Theorem 6. Let E be an arbitrary event and C be the family of conditional 
events {E\Hi}i^j, where card{I) is arbitrary and the events 77j’s are a partition 
of 12. Denote by 77 the algebra spanned by the TL^’s and let / : C — >■ [0, 1] be any 
function such that (3) holds (with Ei = E for every i G I). Then any coherent P 
extending f on JC = {E} x 77° is a capacity if and only if it is a possibility. 

7 Conclusions 

Going back to our interpretation of a membership function p.{x) through a suit- 
able coherent conditional probability (a measure of how much You, given the 
event = {X = a;}, are willing to claim the relevant property (/?), and putting 



Ho = {x G Cx : pi-{x) = 0 } , Hi = {x gCx ' t^{x) = 1 } , 
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the conditional probability P{E\H'^), with H = Ho V Hi , is a measure of how 
much You are willing to claim property ip if the only fact you know is that 
X G H . And this will is “independent” of your beliefs corresponding to the single 
values x: in fact, even if “H false” corresponds to the truth of {Vj, '■ x G H^^} , 
nevertheless there is no additivity requirement, since conditional probability is 
not additive with respect to the disjunction of conditioning events. 

On the other hand, every membership function can he regarded as a possibility 
distribution. If A is an algebra of subsets of Cx, the ensuing possibility measure 
can be interpreted in the following way: it is a sort of “global” membership 
(relative to each finite A G A) which takes, among all the possible choices for 
its value on A, i.e. among all possible extensions satisfying (4), the maximum 
of the membership in A. Moreover, we proved in [9] that we can regard every 
possibility measure H as a, decreasing function of the elements of the zero-layer 
set {0, 1, 2, . . . , A:} generated by the coherent conditional probability P agreeing 
with H (in the sense of the last Remark of Sect. 3). In conclusion, the coherent 
extensions of a conditional probability P{E\AA) that satisfy (4) give 
rise to different zero-layers for the atoms A^ corresponding to different 
P{E\Ax), so that such a coherent conditional probability P{E\-) can be suitably 
associated to a measure of your “disbelief” in the events A G A. 

Then some of the above conclusions may appear counterintuitive (for exam- 
ple, considering the statement “Mary is young”, if you know that Mary’s age is 
X = 39, you may be willing to put p,{x) = .2, while if you know that her age 
is y = 26, you may be willing to put y{y) = .9 ; then, knowing instead that 
her age is between 26 and 39, the corresponding possibility is still .9): in fact, 
the “global” membership should possibly decrease when the information is not 
concentrated on a given x, but is “spread” over a larger set. So our results may 
suggest to take as such global measure a function which is not a capacity, yet 
satisfying the weaker conditions of Lemma 1. 

In a forthcoming paper [10] we will deepen these aspects, looking on a coher- 
ent conditional probability as a suitable (anti-monotone) information measure. 
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Abstract. In this paper, we apply decision trees (DT) to intrusion de- 
tection problems. Experimentations are done on KDD’99 datasets. These 
data offer main features needed to evaluate intrusion detection systems. 
We consider three levels of attack granularities depending on whether 
dealing with all attacks, or grouping them in special categories or just 
focusing on normal and abnormal behaviours. We also extend the classi- 
fication procedure to handle uncertain observations encountered in con- 
nection features. To this end, uncertainty is represented by possibility 
distributions and the inference in DT is based on the qualitative possi- 
bilistic logic. 



1 Introduction 

Decision trees [2,15,11,13] are one of the most commonly classification meth- 
ods used in supervised learning approaches. Standard decision trees only allow 
to deal with instances where all attributes are precisely defined. They are thus 
inappropriate to classify instances with uncertain attributes. Ignoring this un- 
certainty can affect the efficiency of the obtained results. 

This paper proposes an extension of the inference method for classifying 
new instances containing uncertain attributes. This uncertainty is handled in a 
possibilistic logic framework. 

An application to the intrusion detection problem in the context of infor- 
mation systems using classical decision trees is detailed. Our aim is not only to 
evaluate the capacity of DT in the classification problem, but also to see how 
decision trees can be appropriate for a detection intrusion problem. Different 
experimentations are performed on the KDD’99 datasets. These data are ap- 
propriate to evaluate an intrusion detection system. Indeed, the set of different 
attack are not equitably represented. For instance, in the training set, there 
are only 0.01% of U2R attacks, while DOS attacks are represented by 79.24%. 
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Moreover, there are attacks which are present in the testing test but not in the 
training set. Hence, these data can be used to check the capability of an intru- 
sion detection system to detect new attacks. In this paper, experimentations are 
performed according to three levels of attack granularities depending on whether 
dealing with all attacks, or grouping them in special categories or just focusing 
on normal and abnormal behaviours. Then, an illustrative example of the use of 
qualitative decision trees in the intrusion detection problems is presented. 

This paper is organized as follows: Sect. 2 provides a brief background on 
decision trees. Section 3 presents an extension of the classification procedure 
when testing instances contains uncertain attributes represented by qualitative 
possibility distributions. Section 4 presents intrusion detection problems. Then, 
Sect. 5 experiments the use of decision trees in intrusion detection with a deep 
analysis of the normal behaviour. Finally, Sect. 6 illustrates the use of qualitative 
decision trees in this field. 

2 Basics of Decision Trees 

Decision trees are especially used in artificial intelligence due to their ability to 
express classification knowledge in a formalism easy to interpret. Decision trees 
present a system using a top-down strategy based on the divide and conquer 
approach where the major aim is to partition the tree in many mutually exclusive 
subsets. Each subset partition corresponds to a classification sub-problem. A 
decision tree is composed of three basic elements: 

- decision nodes specifying the test attribute, 

- edges corresponding to one of the possible values of the test attribute outcomes, 

- leaves including objects that, typically, belong to the same class. 

Two major procedures should be ensured with decision trees: 

1. Building the Tree. Based on a given training set, a decision tree is built. 
Under an attribute selection measure a test attribute should be chosen at each 
decision node allowing to diminish as much as possible the mixture of classes be- 
tween each subset created by the test. In other words, the main idea is therefore 
to find the test attribute in order to get disjoint data facilitating the determi- 
nation of objects’ classes. This process will continue for each sub decision tree 
until reaching leaves and fixing their corresponding classes. 

2. Classification. Once the tree is constructed, it is used in order to classify 
a new instance. We start at the root of the decision tree, we test the attribute 
specified by this node. The result of this test allows us to move down the tree 
branch according to the attribute value of the given instance. This process is 
repeated until a leaf is encountered, the instance is being then classified in the 
same class as the one characterizing the reached leaf. 

A generic decision tree algorithm is characterized by the next properties: 

- The Attribute Selection Measure: The idea is to use an attribute selec- 
tion measure taking into account the discriminative power of each attribute over 
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classes. The selection measure is generally based on the information theory, we 
can for instance mention those suggested by Quinlan: the information gain [11] 
and the gain ratio [13]. More details concerning the different attribute selection 
measures can be found in [14]. 

-- The Partitioning Strategy: The current training set will be divided by 
taking into account the selected test attribute. When we deal with discrete at- 
tributes, it consists in testing all the possible attribute values. However, a dis- 
cretization step is generally needed [16] in the case of numeric attributes. 

- The Stopping Criteria: They determine whether or not a training subset 
will be further divided. It is generally fulfilled when all the remaining objects 
belong to only one class. Therefore, the part of the decision tree verifying this 
criterion will be declared as a leaf. 

3 Qualitative Possibilistic Inference 

Standard versions of decision trees are inadequate to ensure their role of classifi- 
cation in an uncertain environment. In such a case the correctness of classification 
results may be affected. Indeed, existing approaches [3,5,12], which to some ex- 
tent allow to deal with missing or uncertain data, always classify an uncertain 
instance in a unique class. This can lead to an arbitrary choice in ignorance 
situations. Thus, the idea in this section is to develop a qualitative possibilistic 
inference procedure adapting decision trees to classify instances characterized by 
uncertain attributes. This uncertainty is represented by the means of qualitative 
possibility distributions. 



3.1 A Brief Refresher on Qualitative Possibilistic Logic 

This sub-section briefly introduces possibilistic logic (see [4] for more details) 
which is an extension of classical logic to deal with uncertain information. Un- 
certainty is here assumed to be represented qualitatively by a finite and totally 
ordered scale denoted hy L = {1, a \, ..., a„, 0} such that 1 > oi > ... > a„ > 0. 
The basic concept of a qualitative possibilistic logic is the notion of Qualitative 
Possibility Distribution (QPD), denoted by tt. 

A QPD 7T is a function which associates to each element oj of the universe of 
discourse 17, here a set of interpretations of a propositional language, an element 
from L, (tt encodes our beliefs on a real world). By convention, = 1 means 
that it is completely possible that oj is the real world, 7r(w) = 0 means that uj 
cannot be the real world, and 7t(w) > Tr(w') means that uj is as at least possible 
as uj' to be the real world. A QPD tt is said to be normalized if there exists uj 
such that tt{uj) = 1. 

At the syntactic level, uncertain information are represented by means of 
a qualitative knowledge possibilistic base. A qualitative possibilistic knowledge 
base (KB) U is a set of weighted formulas of the form {(j)i,ai) where (fi is 
a propositional formula and ai G L is its uncertainty degree which estimates 
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to what extent it is certain that 4>i is true considering the available data. Each 
qualitative possibilistic knowledge base S induces a unique QPD of the form: 

w cO ( ^ — e S,U} \= 4>i 

w ,TTs(UJ) min{A'e(ai) : G X’jO) ^ 4>i} otherwise. 

where Ne is a reversing scale function, namely: 

Ne{ai) € L,Ne(l) = 0,Ne(0) = 1, Ne(Ne(ai)) = ai and Ne{ai) > Ne{aj) iff 

Oi < aj. 

The syntactic inference in possibilistic logic is achieved using the following res- 
olution principle: 

(P V q, a*) 
r, aj) 



{p V r, min{ai, aj)) 

Then to check if a conclusion r/i is a possibilistic consequence of a possibilistic 
knowledge base S, denoted by E \=t^ -ip, we proceed by refutation in the follow- 
ing way. First, we add 1) to the knowledge base S. Then we compute, using 
the possibilistic resolution rule above, the highest degree associated with contra- 
diction T. Then this degree is the certainty degree associated with ip from E [4]. 



3.2 Qualitative Decision Tree Inference for Uncertain Observations 



This sub-section shows how qualitative possibilistic logic [4] can be used for the 
inference in decision trees in presence of uncertain observations. 

Standard decision trees only deal with completely certain information. 
Namely, all attributes uniquely determine the class to which the instance be- 
longs, since there exists exactly one path, from the root node to the leaf class, 
which is applicable. 

Nevertheless, in practice, attributes are not always precisely defined. Let 
Ai,...,An be different attributes of the problem. The instance to classify is 
described by a vector of possibility distributions it = {ttai, An at- 

tribute Ai is precisely defined if there exists exactly one value a G DAi such 
that 7T^i(a) = 1, and for all other values a' yf a,TTAi{a') = 0. A missing data 
regarding an attribute A{, is represented by a uniform possibility distribution 
tta (i.e., Va G U^,7r^(a) = 1). 

At the semantic level, handling uncertain observations in possibilistic logic 
is achieved by combining conjunctively (using the minimum operator) the pos- 
sibility distributions Tr^ds with the possibility distribution associated with 
decision tree (which represents the knowledge base). The possibility distribu- 
tion TTs is a two layers one i.e., each interpretation has either 0 or 1 value. Let 
Oi A ... A a„ A Ci be a given interpretation, then: 



7r(aiA...Aa„ACi) 



j 1 if there exists a path from the root to Ci using {oi A ... A a„} 
[ 0 otherwise. 



Then, the selection of the class(es) to which the instance it = (tt^i, ...,t:a„) 
belongs to is defined following two steps: 
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— Combine ttai, and in a conjunctive way, namely compute: 

'^Result — TniflilT A\ ■) ■ • ■ ^ An ^ ^ ) ■ 

— Select class (es) Ci having a highest possibility degree in n Result- 

At the syntactic level, the uncertainty regarding an attribute Ai can also 
be represented by a possibilistic knowledge base SAi ■ Of course, should be 
associated with tt^. using definition of Sect. 3.1. Note that standard possibilistic 
logic only deals with binary variables, while generally attributes used in deci- 
sion trees are not necessarily binary. In this case, the introduction of domain 
exclusion domain constraints is necessary. For instance, let us assume that the 
domain associated with an attribute Ai is {6i, 62, ^3}- Assume that the available 
information about Ai is: the value of A\ can be either b\ or 62 (namely, 63 is ex- 
cluded), and bi is more plausible (with a degree a). These pieces of information 
are represented by: = {{-'bi V -162, 1), (-'61 V -i&3, 1), (-i&2 V -^b^, 1), {bi V 62 V 

^3, 1 ), (^1 V 62, 1), (^1, «)}• 

The first 4 formulas represent the exclusion domain constraints while the last 
2 formulas encode the available information on A*. It can be checked that 
can be equivalently rewritten as : Ea^ = {(-'63, 1), (-'62, a), (61 V &2, 1), (&ij a)}) 
and that is (61) = 1,7 t^. (62) = 1 — a^TTAiib^) = 0. 

The decision tree can also be immediately represented in possibilistic logic 
base A in the following way: if (ai(root), ...,a„,c) is a path, then we add to the 
knowledge base the rule: (oi A ... A a„ c, 1). 

Now, we can check that, given Ea-i^ , ■■■, Ea„ and A, the possibilistic knowledge 
base associated to Tr^es is simply the union of knowledge bases Ayi^ U...U A^^ U A. 
Therefore, an instance if = (tt^^ , ..., 7ta„) belongs to a class c if c is a possibilistic 
consequence of Ea^ U ... U A^^ U A. 

In almost all fields where decision trees are applied and whenever uncertainty 
happens, the qualitative decision tree inference will be useful. An illustrative 
example on intrusion detection systems will be presented in Sect. 6. But first we 
briefly present what is an intrusion detection system and how decision trees can 
be used. 

4 Intrusion Detection Systems 

Intrusion in the context of information systems is regarded as a set of attempts 
to compromise a computer network resource security. There are two general ap- 
proaches to intrusion detection (for an introductory work on intrusion detection 
systems, see [1]): 

- Anomaly Detection: based on the detection of an anomaly in a user behaviour. 
The idea is that each user has a certain profile within the system that will not 
be changed a lot in time. Then, this profile is expected to be ‘normal’ and con- 
sequently any significant deviation from it will be considered as an anomaly. 

- Misuse Detection: also named signature detection since in this case any intru- 
sion can be described by its signature characterized by the values of its features. 
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Detecting intrusions is generally ensured by using the audit data generated 
by the operating system. This paper is more oriented to the anomaly detection 
approach. 

The data used here are those proposed in the KDD’99 for intrusion detection 
[9] which are generally used for benchmarking intrusion detection problems. 
They set up an environment to collect TCP/IP dump raws from a host located 
on a simulated military network. Each TCP/IP connection is described by 41 
discrete and continuous features and labeled as either normal, or as an attack, 
with exactly one specific attack type. Attacks fall into four main categories: 

— Denial of Service Attacks (DOS) in which an attacker overwhelms the victim 
host with a huge number of requests. Such attacks are easy to perform and 
can cause a shutdown of the host or a significant slow in its performance 
(e.g. Pod, Teardrop, Neptune, Smurf). 

— Probing in which an attacker attempts to gather useful information about 
machines and services available on the network in order to look for exploits 
(e.g. Portsweep, Saint, Satan, Mscan). 

— User to Root Attacks (U2R) in which an attacker or a hacker tries to get 
the access rights from a normal host in order, for instance, to gain the root 
access to the system (e.g. Ftp- write, Imap, Guess.passwd, Phf). 

— Remote to User Attacks (R2L) in which the intruder tries to exploit the 
system vulnerabilities in order to control the remote machine through the 
network as a local user (e.g. Pearl, Xterm, Ps, Rootkit). 

The 41 features characterizing each connection are divided into basic features 
of individual TCP connections, content features within a connection suggested 
by domain knowledge, time based traffic features computed using a two-second 
time window and host based traffic features computed using a window of 100 
connections used to characterize attacks that scan the hosts (or ports) using 
much larger time interval than two seconds. Some of these features are listed in 
Table 1. 

5 Decision Trees for Intrusion Detection Systems 

This section presents results on intrusion detection using standard decision trees 
of Quinlan [13]. Several experimentations will be performed on KDD’99 dataset. 



5.1 Different Case Studies 

We handle 10% of this dataset corresponding to 494019 training connections and 
311029 testing connections. There exist several recent works on using decision 
trees on KDD’99 dataset [10]. However, experimental results presented here give 
a new perspective, in particular different levels of classification’s results are con- 
sidered. Indeed, we perform several experimentations according to three levels 
of attack granularities: 
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Table 1. List of features 

Feature name Description 

Basic features of individual TCP connections 
A1 duration length (number of seconds) of the connection 

A2 protocoLtype type of the protocol, e.g. tcp, udp, etc. 

A3 service network service on the destination, e.g., http, telnet, etc. 

A4 flag normal or error status of the connection 

A5 src.bytes number of data bytes from source to destination 

A6 dst_bytes number of data bytes from destination to source 

A7 land 1 if connection is from/to the same host/port; 0 otherwise 

A8 wrong_fragment number of “wrong” fragments 

Content features within a connection suggested by domain knowledge 
A9 hot number of “hot” indicators 

AlO num_failed_logins number of failed login attempts 

All loggeddn 1 if successfully logged in; 0 otherwise 

A12 num_compromised number of “compromised” conditions 

A13 num_file_creations number of file creation operations 

A14 num_shells number of shell prompts 

Time based traffic features computed using a two-second time window 
A15 count number of connections to the same host 

A16 same_srv_rate % of connections to the same host using the same service 

Host-based traffic features computed using a window of 100 connections 
A17 dst -host mount number of connections to the same host 

A18 dst_host_srv_count number of connections to the same host using the same service 

A19 dst_host_diff_srv_rate % of connections to the same host using different services 

A20 dst_host_same_src_port_rate % of connections to the same host having the same src port 
A21 dst_host_srv_diff_host_rate % of connections to the same host and same service 

using different hosts 

A22 dst_host_srv_serror_rate % of connections to the same host and same service 

having “SYN” errors 

A23 dst_host_rerror_rate % of connections to the same host having “REJ” errors 



- Whole-Attacks: all attack classes presented by KDD dataset in addition to the 
normal situation. 

- Five-Classes: the four attack categories (i.e. DOS, R2L, U2R, Probing). Note 
that there are 19.69% (resp. 79.24%, 0.23%, 0.01%, 0.83%) of normal (resp. 
DOS, R2L, U2R,Probing) training connections and 19.48% (resp. 73.90%, 5.21%, 
0.07%, 1.34%) of normal (resp. DOS, R2L, U2R, Probing) testing connections. 

- Two-Classes: i.e. Normal and Abnormal by grouping all attacks in the same 
class (i.e. Abnormal). 

In the five-class and two-class cases, there are two strategies to gather results 
either before or after classification. 

The evaluation of classification efficiency is based on the Percent of Correct 
Classification (PCC) of the instances belonging to the testing set. 

number of well classified instances 

1 O (_/ — 1 

number of classified instances 
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5.2 Experimental Results 

Table 2 gives the PCC of the testing set according to the three levels of attack 
granularities where in the five-class and two-class cases gathering is done before 
classification. More precisely, in the five-class case we slightly modify the dataset 
by grouping attacks belonging to the same attack category (i.e. DOS, R2L, 
U2R or Probing) and in the two-class case we group them in a unique class i.e. 
abnormal. 

It is clear that in the three experimentations the PCC is almost the same and 
it presents a good rate. This means that dealing with all attacks, specific cate- 
gories or only one class, namely the abnormal one, do not affect the classification 
quality using the decision tree technique. 



Table 2. PCC’s of the testing set 



Two-classes 


Five-classes 


Whole-attacks 


92.42% 


92.21% 


91.73% 



In order to analyze misclassified connections, we have studied confusion ma- 
trices by focusing on normal connections over the abnormal ones, namely in 
the whole-attack and five-class cases, we gather results regarding attacks in a 
unique abnormal class after the classification procedure^. Induced results are 
summarized in Table 3. 



Table 3. Confusion matrix relative to the normal and abnormal classes (values between 
parentheses are relative to gathering whole-attacks and five classes results into two 
classes after classification) 



CLASSIFIED AS ^ 


Normal 


Abnormal 


Normal 


98.24% 


1.76% 


(60593) 


(99.42%, 98.28% ) 


(0.58%, 1.72%) 


Abnormal 


8.98% 


91.02% 


(250436) 


(9.02%, 9.00%) 


(90.98%, 91.00%) 


PCC 


92.42% 




(92.62%, 92.42% ) 



This table shows that normal connections are usually very well classified. It 
also shows that for classifying normal instances, it is better to gather different 
attacks in abnormal ones after classification using the initial dataset (i.e. con- 
taining the whole attacks) rather than before learning it (two-class or five-class 
cases) . This behaviour can be explained by the fact that each leaf in the decision 

^ The confusion matrices also provide the recall criterion, which can be read from the 
diagonal of the matrix 
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Table 4. Confusion matrices relative to five classes (values between parentheses are 
relative to gathering whole-attacks results into five classes after classification) 



CLASSIFIED AS ^ 


Normal 


DOS 


R2L 


U2R 


Probing 


Normal (60593) 


98.28% 

(99.42%) 


1.33% 

(0.15%) 


0.02% 

(0.03%) 


0.01% 

(0.01%) 


0.36% 

(0.40%) 


DOS (229853) 


2.72% 

(2.67%) 


97.28% 

(96.87%) 


0.00% 

(0.00%) 


0.00% 

(0.00%) 


0.00% 

(0.46%) 


R2L (16189) 


93.79% 

(96.80%) 


0.01% 

(0.00%) 


3.40% 

(3.04%) 


0.67% 

(0.03%) 


2.13% 

(0.13%) 


U2R (228) 


91.67% 

(58.33%) 


0.44% 

(0.44%) 


2.63% 

(7.46%) 


3.51% 

(5.26%) 


1.75% 

(28.51%) 


Probing (4166) 


21.22% 

(15.70%) 


3.77% 

(5.42%) 


0.24% 

(0.00%) 


0.00% 

(0.00%) 


74.77% 

(78.88%) 


PCC 


92.21% 

(92.18%) 



tree is labeled by the more probable class. So, the larger number of attack classes 
is, the biggest chance to have leaves labeled as normal class, and consequently 
most normal testing connections will be well classified. 

Contrary to normal connections, abnormal ones are not always well classified 
as confirmed by PCC of abnormal instances given in Table 3 (even if the global 
PCC is high). Indeed, a thorough analysis of misclassified abnormal connections 
(see Table 4) shows that the U2R and R2L attacks in the testing set are always 
misclassified (only 3.51% (resp. 3.40%) of U2R (resp. R2L) connections are well- 
classified when we gather attacks before classification) . This behaviour does not 
reflect the optimistic results obtained on the training data where 82.69% (resp. 
98.93%) of U2R (resp. R2L) connections are well-classified. This is due to the 
fact that the proportions, in the training set, of U2R and R2L attacks are very 
low (0.01% for U2R and 0.23% for R2L). 

In fact, within decision trees, when a class is presented by a low number 
of training instances, then it leads to a weak learning regarding this class and 
consequently to a misclassification of testing connections really belonging to it. 
Hence, we can have new testing instances really belonging to U2R and R2L 
attacks, but characterized by attributes’ values which deviate from those char- 
acterizing these two classes in the training set. These instances are not already 
learned in the construction phase and their resulting class when applying the 
induced tree are generally wrong. To illustrate this, let us analyze the rule base 
relative to U2R induced from the decision tree: 

- Rl: if A15 < 64, A22 < 0.03, A12 < 0, H8 < 0, A21 < 0.48, A19 < 0.91, A9 < 0, 

A20 < 0.99, AIQ > 0.32, A4 = SF, A5 > 6, A14 < 0, A6 > 6, A5 < 36530, 

A13 < 0, H18 < 4, H3 = telnet, A6 < 1342, A1 > 20 then U2R 

- R2: if A15 < 64, A22 < 0.03, A12 < 0, T8 < 0, A21 < 0.48, T19 < 0.91, A9 < 0, 

A20 < 0.99, A16 > 0.32, A4 = SF, A5 > 6, A14 < 0, A6 > 6, A5 < 36530, 

dl3 > 0, A18 < 3 then U2R 
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- R3: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A19 < 0.91, A9 < 0, 
A20 < 0.99, A16 > 0.32, A4 = SF, A5 > 6, A14 > 0, A3 = telnet then U2R 

- R4: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A19 < 0.91, A9 < 0, 
A20 > 0.99, A5 < 19, A17 < 1, A2 = top, A7 = 0, A3 = ftp.data, A4 = SF then 
U2R 

- R5: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A19 < 0.91, A9 < 0, 
A20 > 0.99, A5 < 19, A17 > 1, A6 > 1, A3 = ftp.data, All = 1 then U2R 

- R6: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A19 < 0.91, A9 > 0, 
A4 = SF, A5 < 132, A21 > 0.01 then U2R 

- R7: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A19 < 0.91, A9 > 0, 
A4 = SF, A5 < 132, Al < 140, A18 < 2, A13 > 2 then U2R 

- R8: if A15 < 64, A22 < 0.82, A12 < 0, A8 < 0, A21 < 0.48, A6 > 6, A23 < 0.45, 
AlO < 0, A5 < 1 then U2R 

- R9: if A15 < 64, A22 < 0.82 A12 > 0, A5 < 10073, A6 > 0.09, A12 < 13, 
A3 = telnet, A6 < 6223 then U2R 

- RIO: if A15 > 64, A6 > 1, A18 < 69, A13 > 0 then U2R 

When analyzing the testing subset relative to U2R attacks, we firstly note 
that for the attribute A15 which is the root of the tree, all its corresponding 
values are less or equal than 64 which excludes the use of RIO. Continuing 
analyzing the U2R testing connections, we remark that the majority should 
follow rules from R1 to R5 (187 of them satisfy A15 < 64, A22 < 0.82, A12 < 0, 
A8 < 0, A21 < 0.48, AlO < 0.91, A9 < 0). However, in all of these rules A4 
should be equal to SF which is not the case in the majority of testing connections 
which implies their miss-classification since they do not satisfy any of the rules 
R1 to RIO. This can be explained by the fact that in the learning phase A4 
appears with the value SF while in the majority of testing connections it takes 
the value REJ which never appears in the learning set with U2R attacks. 



6 Qualitative Decision Trees for Intrusion Detection 
Systems 

Different experimentations performed above suppose that the connections to 
classify are certainly known which is not always the case in the real TCP/IP 
traffic. Indeed, the used testing set corresponds to an Off line traffic and a more 
interesting task will be to classify connections On line. This supposes that at an 
instant t we should be able to classify a connection even if some of its character- 
istic attributes are partially known. Of course, the partially specified attributes 
are not always the same, and are generally different from one connection to 
another. 

In such a case the use of qualitative decision trees presented in Sect. 3 seems 
to be appropriate since it allows the classification of uncertain connections. So 
what we have proposed may not only classify connections having some missing 
values but also those characterized by possibility distributions on their values. 

Let us illustrate this with the decision tree (representing four classes namely 
N, S, P and T) given in Fig. 1. 




Decision Trees and Qualitative Possibilistic Inference 



429 



count (A) 

< 46 > 46 

(ai)X X. {ai) 



same_srv-rate (B) protocoLtype (D) 



< 32 


> 32 


top yAz 


dx icmp 


(&i)/ 




(di)/ 


(dXXs) 


wrong fragment (C) 


Pod. 


Smurf. Normal. Smurf. 


< 0 > 0 


(P) 


(S) 


(N) (S) 


{ciy \(C2) 








Normal. Teardrop. 









(N) (T) 

Fig. 1. A decision tree for intrusion detection 



(ai A &i A Cl => A", 1) ( ai, 1) 

( Iji'aX^^A, 1) (ci, 1) 



(6i^A, 1) (&i,Ae(a)) 



( A, Aefa)) (-nA, 1) 



(T, Ae(a)) 

Fig. 2. Refutation proof for (A, Ae(a)) 



Assume that the instance to classify is = (tt^, tt^, ttc, tt^i) with: 




Namely A and C are precisely described, while there are some uncertainty on 
B and D. The knowledge bases associated to tta, ttb, ttc, ttd and to the DT are: 
= {(oij l)}j = {{bi, Ne{a2))}, Bq = {(ci, 1 )}, Sp, = {{di V ^2, 1 ), (di, 
Ae(o;2))}, £■ = {(ai A6 i Aci A, 1 ), (oi A6 i Ac2 T, 1 ), (oi A62 = 1 ^ 1 ), (02 A 

(di V dz) ^ S, 1 ), (o2 Ad 2 ^ N, 1 )}. 

To these knowledge bases, we should add the following domain exclusion 
constraints: (di V ^2 V dz, 1 ), (-idi V -1^2, 1 ), (“'di V -•dz, 1 ), {~'d2 V -ids, 1 ). 

From these knowledge bases, we can check that: S U Ba U Bp U Bp A Bp, 
(N,Ne{a2)), where is the possibilistic logic inference. Indeed, let us add 
(-■A, 1 ) to the knowledge base, namely let us assume that the instance does 
not belong to the normal class. The following refutation shows that there is a 
contradiction and hence provides the certainty degree of N which is Ne{a). 

No one of the other classes, i.e. (S, T, P) can be inferred with a weight 
greater than Ne{a2)- Hence, it will be classified as a normal connection. In this 
example, there was no problem in classifying the instance it. Assume now that 
we have the following instance tt^, ttc, ttd) to classify, with: 
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In this instance, all variables are precisely defined, except D where the only 
available information is that the protocol type is not an icmp. 

The knowledge base associated to and to the decision tree are: 

= {(“’Ol, 1)}, = {(^ 1 , 1)}, = {(Cl, 1)}^D = {(dl V d2, 1)}, 

S = {(ai A 6 i A Cl -/V, 1), (ai A 6 i A C 2 T, 1), (oi A 62 -P, 1), (02 A (c?i V 

ds) S, 1), (02 Ad.2 ^ N, 1)}. 

Of course, we also add the domain exclusion constraints as above. From 
these knowledge bases, we can check that no class can be inferred with some 
certainty. The only thing that can be inferred is: (S'V N, 1), which expresses the 
fact that the instance ^ certainly belongs to either Normal class or Smurf class. 
This result is satisfactory. Possibilistic logic refuses to make arbitrary choices 
between these two classes, which is not the case when using standard decision 
trees. Another advantage of using possibilistic is the fact that it is incremental in 
the sense that the knowledge base can be easily updated upon the arrival of the 
new information. This is particularly useful since the value of several attributes 
becomes more and more precise with the time, and are only completely defined 
at the end of connections. 

7 Conclusion 

There are two main contributions in this paper. The first one is an extension 
of decision trees to classify instances with uncertain or missing attributes, in 
a possibilistic logic framework. This extension avoids arbitrary classifications 
of instances in partial ignorance situations. Handling uncertain observations is 
very important in intrusion detection systems. Indeed, a complete description 
of all attributes is only possible when connections end up. This can be too late 
to react, and hence it is very recommended to provide plausible classification 
from missing data. The second contribution concerns the use of decision trees 
in intrusion detection systems. The different experimental results presented in 
this paper are very encouraging. This is particularly true when only focusing on 
normal/ abnormal connections. Indeed, a best strategy allows us to have until 
99.42% of normal connections as well classified, and until 91.02% of abnormal 
connections as well classified. A further work will be to test experimentally qual- 
itative decision trees on the intrusion detection problem and make comparisons 
with other algorithms. 
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Abstract. The concept of conditional event dealt with here is that given 
by Coletti and Scozzafava in a series of paper: a detailed account of 
the relevant theory is in their book “Probabilistic Logic in a Coherent 
Setting”, Kluwer (2002). In this paper, our aim is to show that relying 
on this definition many of the (putative) inconsistencies and flaws 
concerning this concept disappear. In particular, the well-known Lewis’ 
triviality principles can be looked on, in this framework, under a different 
perspective, also due to the circumstance that the concept of “indicative 
conditionals” (as put by Lewis, and also by Adams) is a very particular 
case of this general concept of conditional event. A crucial role is played, 
in our approach, by conditional events of probability 0 and 1. 



1 Introduction 

Our aim is to show that, relying on the concept of conditional event as introduced 
by Coletti and Scozzafava in [2] , many of the (putative) inconsistencies and flaws 
concerning this concept disappear. 

In particular, the well-known Lewis’ triviality principles (discussed in [12] 
and partly reshuffled in [13]) can be looked on, in our framework, under a dif- 
ferent perspective, also due to the circumstance that the concept of “indicative 
conditionals” (as put by Lewis, and also by Adams [1]) is a very particular case 
of the concept of conditional event given in [2] . 

This concept plays a central role also for the probabilistic reasoning. It gen- 
eralizes (or better, in a sense, it gives up) the idea of de Finetti [6] of looking at 
a conditional event E\H, with iL yf 0 (the impossible event), as a three-valued 
logical entity {true when both E and El are true, false when H is true and E is 
false, “undetermined” when H is false) by letting the third value suitably depend 
on the given ordered pair {E, H) and not being just an undetermined common 
value for all pairs. 

It turns out that this function can be seen as a measure of the degree of belief 
in the conditional event E\H, which under “natural” conditions reduces to the 
conditional probability P{E\H), in its most general sense related to the concept 
of coherence, and satisfying the classic axioms as given by de Finetti in [7]. 
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We will refer to a suitable multi-valued logic (for an expository paper on 
this subject, see [14]) with partial (and truth-functional) connectives, where the 
truth-values have a “logical meaning” in terms of a betting interpretation. This 
approach is explained in detail in [3], while a formal axiomatic account will be 
given in a forthcoming paper [8] . 

We stress that this approach neither refers to Boolean-like structures, nor 
tries to define logical operations for every pair of conditional events. So it clearly 
differs from (and is much simpler than) other well-known interpretations exist- 
ing in the relevant literature, such as, e.g., those of Dubois and Prade [9] and 
Goodman and Nguyen [10] (by the way, also these authors examine critically 
Lewis’ results). 

As a final comment, let us mention that, according to an anonymous referee, 
a “very similar solutions to Lewis’ triviality results” appeared in “Reasoning over 
Impossible Worlds” by A. Lopez-Ortiz, Journal of Computing and Information, 
Vol 1, 1995 (and also in International Conference on Computing and Informat- 
ics, ICCI, 1994). We checked this issue (and all the others) of that Journal (and 
also the list of papers of 1994 ICCI Conference), and we were not able to find 
the aforementioned paper. Nevertheless we have found a working paper with the 
same title on the web site of the author: its content has nothing to do with our 
approach, in which conditioning with respect to = 0 (the impossible “world” ) 
makes no sense. We allow instead (as repeatedly explained in the paper) condi- 
tioning with respect to not impossible events of probability zero (see Sects. 4 
and 5), which is definitely a quite different stuff. Last (but not least) in that 
paper Lewis’ triviality principle is not challenged! 



2 Events 

An event can be singled-out by a (nonambiguous) statement E, that is a 
(Boolean) proposition that can be either true or false (corresponding to the 
two “values” 1 or 0 of the indicator Ie of E). 

Two particular cases are the certain event 17 (that is always true) and the 
impossible event 0 (that is always false): notice that 17 is the contrary of 0, and 
vice versa. Notice that only in these two particular cases the relevant 
propositions correspond to an assertion. 

To make an assertion, we need to say something extra-logical or concerning 
the existence of some logical relation, such as “You know that E is false” (so 
that E = %). 

The “logic of certainty” deals with TRUE and FALSE as final, and not 
asserted, possible answers, while with respect to a given state of information 
there exist, as alternatives concerning an event, besides that of being possible, 
also those of being certain or impossible. 

Probability is a function (defined on a Boolean algebra of events) which 
satisfies the usual and classic properties, i.e.: its range is between zero and 
one (these two extreme values being assumed by - but not kept only for - the 
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impossible and the certain event, respectively), and it is additive for mutually 
exclusive events. 

Notice that (as it is well-known) an impossible event (that is, an event known 
to be false) has probability 0, but NOT conversely (so that we may have many 
possible events whose probability is zero) . Analogous remarks can be stated for 
a certain event (that is, an event known to be true), which may not be the 
only one having probability 1. 



3 Conditional Events 

Are conditional events truth valued? If “truth valued” means two-valued (i.e.. 
Boolean), the answer (in agreement with Lewis) is certainly “NO”. But if we refer 
to a “multi-valued” logic, conditional events can be seen as truth valued, in 
the following sense (in terms of a betting scheme), which extends the well-known 
interpretation of the probability P{E) of an event E as the amount paid to bet 
on E, with the proviso of receiving an amount 1 if if is true (the bet is won) or 
0 if if is false (the bet is lost), so that 

“the indicator Ie is just the amount got back by paying P{E) in a bet on E” 
Consider now a conditional bet on if|iL : an amount t{E,H) is paid to bet 
on if|if getting, when El is true, either an amount 1 if also E is true (the bet is 
won) or an amount 0 if if is false (the bet is lost), and getting back the amount 
t{E,El) if H turns out to be false (the bet is called off). 

In short, the truth-value T{E\El) of a conditional event if|ii - that reduces, 
for an (unconditional) event if, to the indicator Ie = T{E\Q) - is introduced in 
the following way: 

“the value o/T(if|ii) is just the amount got back 
by paying t{E,H) in a bet on E\El”, 

so that it can be written (assuming, for any given H, that t{-, H) is not identically 
equal to zero), 



T(EjH) = 1 ■ Ieah + 0 • Ie^ah + t(E, H) ■ Ih^ . 

In other words, a conditional event if |iL (or, better, its truth-value) can be seen 
as a discrete random quantity 



T{E\H)=Y = Y^yklE,, 

k^l 

taking ly = 3, Ei = E A H, E 2 = A and yi = 1, ^2 = 0, 

y 3 = t{E,H). 

The (ordered) pair (if, if) is called the Boolean support of the conditional 
event E\H. 

More details are in the book [3], Chapter 10, where it is also shown that, if we 
have an arbitrary family C of conditional events if |if and consider the relevant set 
T of random quantities T(if |if) , this can be endowed with two partial operations 
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(sum and product) in such a way that, by requiring the closure of these operations 
inside this particular class T, it turns out that the map t{E, H) satisfies the 
classic de Finetti-Popper axioms of a conditional probability: 

(i) t(UM) = \, 

(ii) t[{E\J A), =t{E,H)+t{A,H), for Fi A A A iJ = 0 , 

(in) t({EAA),H'^ = t{E,H) ■ t(^A,{E A H)^ 

So the function t{E,E[) can be denoted also by the usual symbol P{E\H). 

The multi-valued logic associated with a given family C = {Ei\Hi}i^i of 
conditional events reduces to a two- valued one when Eli = fl for all i G / 
(i.e. when all conditional events reduce to (Boolean) events). 

It is important to stress that the latter condition is sufficient, but not neces- 
sary to have a family of two-valued conditional events (see the next section). 



4 Conditional Probabilities Equal to 0 or 1 



Consider a family C such that, for any Ei\Eli G C, either t{Ei,Hi) = 1 or 
t{Ei,Eli) = 0. In the first case a conditional event E\El can be seen as the 
event FP” V E, and in the second case as the event E A El (as follows easily 
by checking the truth values of E\H and of the aforementiond corresponding 
(Boolean) events). 

In particular, if we assert FP” V E = fl, then H C E {E logically implies 
iP), so we certainly have t{E, H) = 1; if we assert EAR = 0, then we certainly 
have t{E, H) = 0 (in the latter two cases the conditional event E\H is actually 
even one- valued). 

Notice that for situations which are different, respectively, from the trivial 
ones E AH = % and H C E , the extreme values 0 and 1 admit suitable interpre- 
tations (for the connections with the concept of coherence, see the book [3]). In 
particular, a deepening of the case P{E\H) = 1 leads to a “natural” treatment 
of the default reasoning: see [4], [5]. 

Moreover, P{E) = 1 does not imply P{E\H) = 1, as in the usual framework, 
where P{E) = 1 and (the necessary assumption) P{H) > 0 imply EAR 0. We 
can take instead P{H) = 0 (the conditioning event H - which must be a possible 
one - may in fact have zero probability, since in the assignment of P{E\H) we are 
driven only by the relevant axioms); then we may have logical relations between 
H and E (in particular, E A H = %, and so P{E\H) = 0). 

On the other hand, it is clearly possible (as it can be seen by a simple applica- 
tion of Theorem 4 of [3] , that characterizes coherent assignments of conditional 
probability) to assess coherently P{E\H) = p for every value p G [0, 1]. In con- 
clusion, a probability equal to 1 can be, in our framework, updated (so 
that the same is true, obviously, for a probability equal to 0). 
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5 Lewis’ Triviality Principles 

The argument of Lewis to establish his triviality results is essentially based 
on proving that, if we apply to a conditional event E\H the following classic 
probability formula 



P{A) = P{E)P{A\E) + P{E‘^)P{A\E^) 



(taking A = E\H and assuming that both P{HAE) and P{H AE'^) are positive), 
we get P{E\P[) = P{E) for any such pairs of events (i.e., stochastic independence 
of P[ and E), since, denoting P{A\B) = Pb{A) , closure under condizionalisation 
requires Ph{A\B) = P{A\B A El) . 

Of course, these positivity assumptions are required by the fact that Lewis 
assumes the usual definition of conditional probability, in the sense that E\H 
is looked on as an object such that, for any probability function P and for any 
events H and E, with P{H) > 0, 



P{E\H) 



P{E AH) 

P{H) 



( 1 ) 



Instead in our framework, due to the direct assignment of P{E\H) as a whole, 
the knowledge (or the assessment) of the “joint” and “marginal” unconditional 
probabilities P{E A H) and P{H) is not required (so that, as already observed 
at the end of Sect. 4, we may have P{H) = 0). 

Then Lewis’ interpretation of a conditional event (called by him “indicative 
conditional”) is (in agreement with Adams) a very restrictive one. The (putative) 
lack of E\H of being truth-valued (in fact they mean non-Boolean, due to the 
fact that multi-valued logics are not taken into consideration) leads Lewis and 
Adams to deal with a conditional event only through its conditional probability 
(in the classic sense) P{E\H) . 

Their starting point is that assertability of an event E is associated to the 
requirement that P{E) is “sufficiently close to 1” (since the value 1 is kept only 
for the certain event), so resorting to a kind of “probabilistic transform” of 
truth assignment (where “highly probable” is substituted throughout for the 
word “true”). 

For conditional events they adopt a similar interpretation, linked to the re- 
quirement of “high probability” ; moreover they claim (as a soundness criterion 
for inferences) that it should not be possible for a premise H to be probable 
while the conclusion E is improbable. 

The first requirement limits the consideration of conditional events only to 
situations corresponding to the so-called default logic, but with the second re- 
quirement what is considered is in fact a very particular case of this theory. 

In this (very restrictive) framework, a trivial language is (using our notation 
and terminology) a family of events containing only Q and 0, and Lewis shows 
{first triviality result) that any language (i.e., family of events) equipped with a 
conditional event - i.e., an object satisfying (1) - is a trivial language. 
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The second triviality result (in Lewis’s framework, but still using our termi- 
nology) is the following: if a class of probability assignments (on a given family 
of events) is closed under conditionalizing, then the family cannot be equipped 
with conditional events, unless the class consists entirely of trivial probability 
functions, i.e. those never assigning positive probability to more than two in- 
compatible events. 



6 A Counterexample 

Given events Ei,Hi,E2, H2, with E2 /\ Hi = %, E\ /\ H2 = % and 
P{Ei A Hi) = P{E2 AH2) = ^, 

p(Ei A m A Eli) = p(E2 a m = 

6 

P{Ei A E 2 ) = P{El A Ei^ A HI A it=) = ^ , 
consider the family of conditional events 



C = {Ei\Hi , E2\H2 , El\Hi , Ei^\H2} . 



We have t{Ei,Hi) = t{E2,H2) = 1, t{Ef,Hi) = t{E^,H2) =0. 
Moreover, the five pairwise incompatible events of the family 

£ = {EiAHi, E 2 AH 2 , EiA hi AE^,EiAE2,E2A H^ A E^} 

have positive probability, and 



t{E,,H,) = P{E,\H,) = 1 ^ P{E,) = i , 
t{Et, Hi) = P{Etm = 0 7 ^ P{Et) = i . 

Notice that we can consider as “closed under conditionalising” (to use Lewis’ 
terminology) the class made up by £ and the initially given four events Ei,Hi, 
E2,H2, since all possible conditioning events have positive probability. 

So in this framework Lewis’ first and second triviality results do not 
hold. 

In his second paper [13] Lewis modifies his interpretation of “closed under 
conditionalising”, considering only conditioning with respect to a given finite 
partition: our example needs only a slight modification, by introducing the par- 
tition whose elements are those of £ plus the contrary of their disjunction, and 
then allowing conditioning only with respect to these six events. 
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7 Further (and Final) Comments 

A further specific features of our approach is that, contrary to what is usually 
emphasized in the literature when a conditional probability P{E\H) is taken into 
account (i.e., that P{-\H) is a probability for any given H, a very restrictive - 
and misleading - view of conditional probability, corresponding trivially to just 
a modification of the so-called “sample space” 17), the conditioning event H is 
instead regarded as a “variable”, i.e. the “status” of H in E\H is not just that 
of something representing a given fact, but that of an (uncertain) event (like E) 
for which the knowledge of its truth value is not required (so that is always 
looked on as an assumed proposition). 

Then, due to this general meaning and interpretation of conditioning, Jef- 
frey’s generalized conditioning [11], introduced by Lewis in his second paper [13], 
is not needed. 

We recall in fact that Jeffrey’s conditioning is based on an argument (that 
does not apply to our framework) of this kind (again, our notation and termi- 
nology): the usual updating from P{E) to P{E\P[) is limited to H representing 
a set of certainties, while in many situations you (the expert) may claim to be 
less than certain that P[ is actually true. Then Jeffrey’s claim is to consider in 
turn H as an information (with respect to a suitable partition of 17) regarding 
its probability: for details, see [11], but what should be clear is that in our frame- 
work we can challenge the above argument. For a deepening concerning assumed 
or asserted propositions, see Sect. 18.6 of [3]. 
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Abstract. In the context of Dung’s abstract framework for argumen- 
tation, two main semantics have been considered to assign a defeat sta- 
tus to arguments: the grounded semantics and the preferred semantics. 
While the two semantics agree in most situations, there are cases where 
the preferred semantics appears to be more powerful. However, we no- 
tice that the preferred semantics gives rise to counterintuitive results in 
some other cases, related to the presence of odd-length cycles in the at- 
tack relation between arguments. To solve these problems, we propose a 
new semantics which preserves the desirable properties of the preferred 
semantics, while correctly dealing with odd-length cycles. We check the 
behavior of the proposed semantics in a number of examples and discuss 
its relationships with both grounded and preferred semantics. 



1 Introduction 

Argumentation theory is a framework for practical and uncertain reasoning 
which has received a great deal of attention in several application areas, such 
as the realization of intelligent autonomous agents [1], automated negotiation 
in multi-agent systems [2] and defeasible reasoning [3]. In a nutshell, common- 
sense reasoning dealing with incomplete and uncertain information is modeled 
as the process of constructing and comparing arguments for propositions. The 
construction of arguments proceeds, from a given set of premises, by chaining 
rules of inference which may represent just provisional reasons for their con- 
clusions. Due to the uncertainty affecting both premises and rules of inference, 
it may well be the case that different arguments support contradictory con- 
clusions, therefore the core problem consists in computing the defeat status of 
the arguments, namely in determining which ones of them emerge undefeated 
from conflict: their conclusions are the most credible ones and are considered as 
justified, while other arguments, being defeated in the conflict, are rejected. 

In order to analyze and compare different kinds of semantics underlying the 
defeat status computation. Dung [4] has proposed an abstract framework where 
arguments are simply conceived as the elements of a set, whose origin is not 
specified, and the interaction between them is represented by a binary relation 
of attack: this way, the current set of arguments can be represented by means of 
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a directed graph, called defeat graph in [1]. Thus, an argumentation semantics 
can be introduced in a declarative way by defining what arguments are justified 
within a generic defeat graph. As pointed out in [5], this definition can follow two 
alternative approaches, namely a unique-status approach or a multiple-status 
approach. In the first approach, the defeat status of the arguments is defined 
in such a way that there is always exactly one possible way to assign them a 
status. This approach is adopted e.g. in the argumentation system introduced 
in [1], and is represented by the grounded semantics in Dung’s framework. On 
the other hand, in a multiple-status approach several extensions are identified. 
Roughly, an extension is a set of arguments which do not conflict among them 
and which attack their attackers. An argument is considered as justified if it 
belongs to all extensions. This is the approach adopted e.g. in [6,7,3], and in 
Dung’s framework is captured by the preferred semantics. 

It has been proved in [4] that the preferred semantics “agrees” with the 
grounded semantics in those arguments that the latter considers as definitely 
justified or rejected. On the other hand, the preferred semantics appears to 
be more powerful with respect to the grounded semantics, in that it is some- 
times able to discriminate some of the arguments that are left undecided by the 
grounded semantics [8]. 

After recalling concepts and definitions about argumentation semantics in 
Sect. 2, we point out in Sect. 3 that the preferred semantics improperly deals 
with odd-length cycles in the defeat graph and we identify some examples where 
this limitation gives rise to counter-intuitive defeat status assignments. To solve 
these problems, we propose in Sect. 4 a new semantics which preserves the 
desirable properties of the preferred semantics, while correctly dealing with odd- 
length cycles. In Sect. 5, the relationships with grounded and preferred semantics 
are investigated. Finally, Sect. 6 concludes the paper. 



2 The Grounded and Preferred Semantics 

In the abstract framework proposed by Dung [4], the primitive notion is that of 
argumentation framework: 

Definition 1. An argumentation framework is a pair AF =< A,^>, where A 
is a set of arguments and is a binary relation of ‘attack’ between them. 

It should be noticed that this definition is generic with respect to the interpre- 
tation of A, which is not specified. In any case, we assume A to be finite, as it 
is necessarily the case when considering a real reasoner. 

In the following, nodes that attack a given a & A are called defeaters of a, 
and form a set denoted as parents(a). If parents(a) = 0, then a is called an 
initial node. Following Pollock [1] we define the grounded semantics inductively 
(an alternative fixed-point definition is given and shown to be equivalent in [4]): 
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Definition 2. Given an argumentation framework AF =< A, — >■>, we define 
for all i > 0 the sets AiQ A as follows: 

f zj' % — 0 

\ {a G A I ^/3 G Ai-i : j3 ^ a} if i > 0 

Definition 3. Given an argumentation framework AF =< A, — >■>, the set of 
undefeated, defeated and provisionally defeated arguments are respectively de- 
fined as follows: 

— Ug(AF) = {a G A I 3m :\/i > m a € At} 

— Dg(AF) = {a G A I 3m : Vi > m a ^ Ai} 

— Pg(AF) = {a G A I Vm 3i > m : a € At A3j > m : a ^ Aj} 

The idea is that an undefeated argument should be believed given the cur- 

rent set of arguments A, a defeated argument should not be believed, while a 
provisionally defeated argument is controversial, thus it should not be believed 
but it should retain the potential to prevent other arguments to be justified. 
This is shown in the following examples. 

Example 1. With reference to the argumentation framework AFi shown in 
Fig. 1, it is easy to see that a belongs to Ai for all i > 0, since it has no 
defeaters, therefore a is undefeated. As a consequence, Vi > 1 /3 ^ Ai, therefore 
[3 is defeated. This entails in turn that Vi > 2, 7 G Ai, therefore 7 is undefeated. 

Example 2. With reference to the argumentation framework AF 2 of Fig. 1, it is 
easy to see that, for all k > 0, both a and /? belong to A 2 fe but don’t belong 
to A 2 fc-i-i, therefore they are provisionally defeated. This alternation of levels is 
inherited by 7 , which turns out to be provisionally defeated as well. 



a P 'i y 

AFi 





Fig. 1. Two different chains 



In the context of the preferred semantics, the notion of ‘defence’ is introduced 
by the following definitions: 

Definition 4. Given an argumentation framework AF =< A, — >■>, a set S' C A 
is conflict-free if and only if ]3a^ (3 € S such that a ^ f3. 

Definition 5. Given an argumentation framework AF =< A, — >■>, a set S C A 
is admissible if and only if S is conflict-free and Va G S, if3j3&A such that 
j3 ^ a then G S such that y ^ j3. 
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Accordingly, a preferred extension is defined as a maximal set which is able 
to defend all its elements: 

Definition 6. Given an argumentation framework AF =< a preferred 

extension of AF is a maximal (with respect to set inclusion) admissible set S C 
A. The set of preferred extensions will be denoted as TV{AF). 

As shown in [4], TV{AF) is never empty, though there are cases where 
TV{AF) = {0}, e.g. TV{AF^) in Example 4 below. Also in this semantics three 
sets of arguments are identified: undefeated arguments belong to all extensions, 
defeated arguments to none, while provisionally defeated only to some of them. 

Definition 7. Given an argumentation framework AF =< A, — >■>, we define 
the following three sets, forming a partition of A: 

- Up(AF) = {ae A\yP G PV{AF) a G P} 

- Dp(AF) = {aGA\yPG PpIaf) a ^ P} 

- Pt^(AF) = {aGA\3Pi,P2G PP{AF) : a G Pi A a ^ P 2 } 

Turning to Example 1 and Example 2, it is easy to see that the preferred 
semantics gives the same outcome as the grounded semantics, since we have 
that PP(AFi) = {{ 0 , 7 }}, while PP(AF 2 ) = {{a, 7 }, {/?}}. The relation be- 
tween grounded and preferred semantics has been analyzed in [4]: in a nutshell, 
the grounded semantics is more cautious than preferred semantics, since for 
all argumentation frameworks it turns out that all the arguments undefeated 
and defeated according to the grounded semantics have the same status in the 
preferred semantics. On the other hand, there may be arguments provisionally 
defeated in the grounded semantics which are defeated or undefeated in the 
preferred semantics, as in the case of floating arguments [5] exemplified below. 

Example 3. With reference to the argumentation framework AF 3 shown in 
Fig. 2, it is easy to see that, according to the grounded semantics, all arguments 
are provisionally defeated. On the other hand, it turns out that TP{AF^) = 
{{a, i5}, {/?, (5}}, therefore we have that P-p(AF 3 ) = {a, /?}, D-p(AF 3 ) = { 7 } and 
U-p(AF3) = {i}. 

In the example above, every preferred extension includes an argument which 
attacks 7 , while no argument attacking 7 belongs to all extensions: this is a case 
of ‘floating defeat’, as it has been called in [ 8 ], which determines in turn the 
‘floating acceptance’ of 5. The inability to discriminate floating arguments is not 
a specific disadvantage of grounded semantics, since Schlechta has proved in [9] 
that it affects any possible single-status approach. 

3 Odd-Length Cycles: A Problem in Preferred Semantics 

According to the definitions presented in the previous section, if the nodes of a 
defeat graph are arranged in a cycle of attack relationships, then they are not 
justified (i.e. they are provisionally defeated) according to both the grounded 
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4 

I y : 5 ; 

y ^ 

■ «/-" AE, 

Fig. 2. Argumentation framework with a floating argument 

and preferred semantics. This seems to be the intuitively right result, since all 
arguments in a cycle should be treated equally for obvious symmetry reasons and 
considering them all justified would yield a contradiction. However this result 
is obtained in rather different ways in the two semantics. In the context of the 
grounded semantics, all arguments forming a cycle simply turn out to belong to 
Ai if i is even and not to belong to Ai if i is odd (see Definitions 2 and 3). 

On the other hand, the preferred semantics features a sort of asymmetry, 
since it treats odd- length cycles differently from the even-length ones. 

Example 4- Considering the argumentation framework AF 4 of Fig. 3, we have 
that EV{AF 4 ) = {{a}, {/?}}, therefore both arguments belong to P-p(AF 4 ). 
With reference to the argumentation framework AF 5 , Definition 6 identifies the 
empty set as the unique preferred extension, therefore all the arguments belong 
to Dt 3 (AF 5 ). More generally, with odd-length cycles there is a unique empty 
extension, while with even-length cycles non-empty extensions exist but their 
intersection is empty. 




/ ' ''\ 

' P J 




AF5 



Fig. 3. Even-length and odd-length cycles 



The peculiar way to assign a defeat status to odd-length cycles has recently 
been indicated as “puzzling” by Pollock [10]. As to our knowledge, however, this 
difference has been considered a mere question of symmetry and elegance in 
previous literature. We show in the following example that the different treat- 
ment of odd-length cycles is a real problem since it gives rise to counter-intuitive 
results. 

Example 5. Considering the argumentation framework AFg shown in Fig. 4, it 
turns out that .^^^(AFe) = {{a, 5}}, therefore a and S are justified according 
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to the preferred semantics. By replacing the cycle (a,/?, 7) with a two-length 
cycle, we obtain the argumentation framework AFy whose arguments all belong 
to P-p(AF7) (and a similar result is obtained with any other even-length cycle). 

In the example above a and 5 emerge (unreasonably) undefeated, while all 
nodes would be provisionally defeated with a similar graph encompassing an 
even-length cycle. It does not seem acceptable that different results in concep- 
tually similar situations depend on the cycle length: symmetry reasons suggest 
that all cycles should be treated equally. The difference arises because an odd- 
length cycle has no extensions besides the empty one: as a consequence, there is 
no extension where 6 is out and 7 is in (such an extension would instead exist 
with an even-length cycle). Since S defeats 7, in this context also a survives. 
Notice that a similar situation arises by replacing the three-length cycles with 
any odd-length cycle: in a sense, odd-length cycles are in this case ‘weaker’ than 
even-length cycles, since they are not able to prevent S from being justified. The 
opposite happens in the following example: 

Example 6. With reference to the argumentation framework AFg shown in 
Fig. 4, it turns out that .^^^(AFs) = {{^2}}, therefore ^2 is undefeated while all 
the other arguments are not justified. On the other hand, by replacing the three- 
length cycle with an even-length cycle, we obtain an argumentation framework 
whose arguments are all provisionally defeated. 



a ► p ) ' a : 

V--/ ‘ 

AP^ ( Y ; AE, 

\ ; Y 

' ' 

; 5 ; V-V 

: 5 ; 



Pj 



^ <' " 

Si ) ; 5, : 

_ 



AR 



Fig. 4. Two problematic argumentation frameworks for preferred semantics 



In the example above, the absence of non-empty extensions for the three- 
length cycle prevents the existence of extensions including <5i, leaving 82 as the 
only accepted argument, while this would not happen with an even-length cycle. 
Notice that in this case odd-length cycles are ‘stronger’ than even-length cycles, 
since the status of <5i is the same as if it would be attacked by an initial node. In 
summary, we notice that besides being treated differently with respect to even- 
length cycles, odd- length cycles exhibit anomalous behaviors: they change their 
capability of defeating other arguments depending on the topology of the defeat 
graph. 
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4 Exploiting the Notion of Maximal Conflict-Free Set 

The analysis carried out in previous sections shows that neither the grounded 
nor the preferred semantics are completely satisfactory and suggests the follow- 
ing requirements for the study of an improved semantics, able to preserve the 
advantages of both of them: 

— it should discriminate floating arguments as the preferred semantics; 

~ it should handle odd-length cycles in the same way as even-length cycles, as 
the grounded semantics; 

~ it should correctly handle the problematic examples shown above; 

— it should not be more skeptical than the grounded semantics, in particular, it 
should agree with it upon the status of undefeated and defeated arguments. 

On the basis of the results provided by Schlechta [9], mentioned in Sect. 2, 
the first requirement can only be satisfied by a multiple-status approach. Thus, 
to satisfy the second requirement we look for a new notion of extension, able 
to remove the anomalous treatment of odd-length cycles. After identifying our 
candidate definition, we will check its properties concerning third and fourth 
requirement. To figure out a proper notion of extension, let us consider again 
Example 4, in which we have recognized as anomalous the fact that AF5 admits 
the empty set as its unique extension. In order to reconcile the treatments of AF4 
and AF5, we can look for the set S of non-empty extensions that can be admitted 
for AF5. First of all, we cannot tolerate contradictions in any extension, therefore 
each extension has to include one node exactly. Moreover, all nodes should be 
treated equally, therefore the only possibility for £ is the set {{a}, {/3}, {7}}. We 
notice that £ is made up of all conflict-free sets of AF5 that are maximal, and 
this suggests to exploit this notion as a basis for a new definition of extension. 

Definition 8. Given an argumentation framework AF =< A, — >■>, we denote as 
TI{AF) the set made up of the maximal (with respect to set inclusion) conflict- 
free subsets of A. 

The above intuition is confirmed by the fact that, by defining the set of 
justified arguments as the intersection of all maximal conflict-free sets, the prob- 
lematic examples of the above section are handled correctly. In particular, with 
reference to the argumentation framework AFe of Example 5, we have that 
.^^^(AFg) = {{a, i5}, {7}, {/3, i5}}, therefore all the arguments are provisionally 
defeated, as prescribed by the grounded semantics. 

Notice that Definition 8 is strictly weaker than Definition 6, since the absence 
of conflicts is one of the conditions for admissibility. Actually, while this brings 
about a correct handling of Example 4 and Example 5, it does not represent a 
satisfactory solution, since due to the increased number of extensions, it would 
tend to assign the status of provisionally defeated to a large number of argu- 
ments (often all of them): this happens, for instance, even for the argumentation 
framework AFi of Example 1, where we have that iFI(AFi) = {{a, 7}, {/?}}. 
Notice that, in this case, the requirement of admissibility would have excluded. 
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among the elements of iFI(AFi), the set {/?}, yielding the intuitively correct 
result. Thus, we are lead to add some further condition to the Definition 8 , in 
order to capture only a subset of the maximal conflict-free sets. In order to do 
this, we draw inspiration from the way the defeat status can be computed ac- 
cording to the grounded semantics. Considering again Example 1 as a simple 
reference, basically, computation proceeds from the frontier of the defeat graph 
towards the inside: the initial node a is assigned the status of undefeated, caus- 
ing f3, which is attacked by a, to be assigned the status of defeated, and this 
in turn causes 7 , whose unique defeater [3 is defeated, to be assigned the status 
of undefeated. Thus, the set {/3} is rejected in this computation schema, which 
is therefore a promising candidate as a way to identify the extensions among 
maximal conflict-free sets. 

In order to refine this intuition, let us consider again Example 3: according 
to the first requirement stated above, our approach should capture exactly the 
preferred extensions Pi = {a, <5} and P 2 = {/3,(5}. Starting from the frontier of 
the graph, the construction of these extensions might proceed according to the 
following steps: 

1. Consider the subgraph involving {a,/3}, and identify the relevant maximal 
conflict-free sets Pi = {a} and P 2 = {/3}; 

2. Consider then node 7 for possible additions to the sets identified in the 
previous step: notice that Pi includes the defeater a of 7 , therefore 7 cannot 
be added to Pi. For the same reason, 7 cannot be added to P 2 as well; 

3. Consider node 6: it can be added to Pi obtaining the extension Pi, since its 
unique defeater 7 has not be added to Pi. In the same way, we obtain P 2 as 

7^U{,5}. 

Notice that, in steps 1-3, we have considered the strongly connected components 
of the defeat graph, i.e. { 7 } and {(5}, respectively. In a sense, we have 

generalized the defeat status computation prescribed by the grounded semantics, 
by considering strongly connected components instead of single nodes. In partic- 
ular, the extensions have been constructed by completing maximal conflict-free 
sets in an incremental way, starting from the frontier of the graph and proceed- 
ing towards the interior. In order to proceed with this analysis in more formal 
terms, let us introduce the following definitions. 

Definition 9. Given an argumentation framework AF =< A,^>, two nodes 
a,P G A are path-equivalent iff either a = (3 or there is a path from a to (3 and 
a path from (3 to a. The strongly connected components of AF are the equivalence 
classes of vertices under the relation of path- equivalence. The set of the strongly 
connected components of AF is denoted as SCC(AF). 

Definition 10. Given an argumentation framework AF =< A, — >■> and a 
strongly connected component S G SCC(AF), parents(S') = {P G SCC(AF) | 
P ^ S A3a G P, P G S :«— >■ P}, and parents* (S') = {a G A \ a ^ S A 3P G S : 
a -A P}. 
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Definition 11. Let AF =< A, — >■> be an argumentation framework, and let S 
be a set S C A. The restriction of AF to S is the argumentation framework 
AFis=<S,^n{SxS)>. 

Let SG be the graph obtained by considering strongly connected compo- 
nents as single nodes, i.e. SG =< SCC(AF),i?* > where (Si,Sj) G R* iff 
Si G parents(S'j). It is easy to see that SG is acyclic: this justifies the idea 
of computing a particular extension E from the frontier towards the inside of 
the defeat graph. Basically, we start from the strongly connected components 
Si that are initial in SG, including in A a maximal conflict-free set of AFj,s. 
for each Si. Then, we proceed by considering an S' G SCC(AF) such that every 
P G parents(S) is initial. Of course, E should not include those nodes of S that 
are attacked by nodes previously included in E. The question is how to proceed 
with the set S^ made up of the other nodes of S. If there is just a single node 
in S^, the indications provided by Example 1 suggest to include it in E. If, on 
the other hand, |S*^| > 1, a tentative solution would be to include in if a maxi- 
mal independent set of S^ . However, a simple example reveals that this option 
does not constrain enough the set of the extensions that can be identified from 
maximal conflict-free sets. 

Example 7. Considering the argumentation framework AFg shown Fig. 5, we 
have that SCC(AFg) = {5'i,S'2}, where S\ = {a} and S 2 = {/3i, /Sgj /^s, /34}. -S'! is 
initial, and its unique maximal conflict-free set is {a} itself. This in turn excludes 
Pi from all the extensions, leading to select a maximal conflict-free set of the 
subgraph AFg^s^Xt/Si}- K turns out that .7^I(AF9j.5,^\{^j) = {{P 2 , 
therefore we get the two extensions E\ = {a, P 2 , Pa} and E 2 = {a, P 3 }, yielding a 
undefeated. Pi defeated and P 2 ,Ps,Pa provisionally defeated. However, in order 
to get the same outcome as the grounded (and preferred) semantics, only Ei 
should be identified as an extension, while Eg should be excluded. 



: i\; 

.■-A X-x 

: a ; ►! p, ; : p ; 



AR 



Fig. 5. An example supporting a recursive definition of extensions 



In order to overcome this difficulty, in the example above S 2 \ {Pi} should 
be treated in the same way as an ordinary graph, i.e. proceeding again from the 
frontier towards the inside. This suggests the alternative option that we choose, 
i.e. to define extensions recursively. 

Definition 12. Given an argumentation framework AF =< A, — >■>, a set E C 
A and a strongly connected component S G SCC(AF), we define: 
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— S^{E) = {a G S' I 3/3 G parents*(S) : (3 G E A P ^ a} 

- S'^{E) = S\S^{E) 

In our proposal, the set of extensions, denoted as ^Ad(AF), is defined as 
follows: 

Definition 13. Given an argumentation framework AF =< A, — >■> and a set 
E C A, we have that E G EM.{AF) iff'iS G SCC(AF) 

1 . S^{E) n if = 0; and 

2. S^{E^f^El ^ */|SCC(AFisu(^))| = 1 

' G lFAd(AF4,5c/(£;)) otherwise 

Following the usual multiple-status approach, the defeat status of arguments 
is identified by the sets Ua^(AF), D^(AF), P^(AF), defined as in Definition 7 
with reference to IFAI(AF) instead of E'P(AF). In order to better understand 
Definition-!- 13, let us show that, differently from the preferred semantics, it gives 
the right outcome to Example 6. 

Example 8. We have that SCC(AFg) = {Si, S2}, where Si = (a, /3, 7} and S2 = 
{i^i, <52}. Taking into account Definition 12 and the fact that parents(Si) = 0, for 
any E Sf(if) = 0 and (E) = Si. Thus, from DefinitionlS a generic extension 
E must satisfy (Siflif) G .i^I(AF84,5[/(£;)), i.e. (Siflif) G {{a}, |/3}, {y}}- In case 
(Si nif) = {a}, taking into account that parents*(S2) = {7} we get S^iE) = 0 
and (S2 n if) G {{^i},{^2}}: thus, we identify the extensions Ei = {o;,(5i} 
and E2 = {a, ^2}- Reasoning in a similar way in the case (Si fl if) = |/3}, 
we identify the extensions ifg = {P,Si} and E4 = {^,82}- Finally, if (Si fl 
E) = {7} then S^iE) = {<5i}, entailing by the first point of Definition 13 that 
(5i ^ E. Moreover, S^ (if) = {<32}) yielding 82 G E and thus identifying the 
extension ifg = {7,152}. In sum, .i^Al(AF8) = {ifi, E2, E3, E4, ifs}, therefore all 
the arguments are provisionally defeated. 

It can be seen that all other examples considered above are handled correctly 
by the proposed semantics. In particular, in the argumentation frameworks AFi, 
AF2, AFy and AFg where preferred and grounded semantics agree, our semantics 
gives the same results (EA 4 () = TV{) in all cases). Also in the argumentation 
framework AF3 iFAd(AF3) = TV{KF^), therefore our semantics correctly agrees 
with preferred semantics. Finally, TM{AFq) = {{a, <5|, {/3, (5|, {7}}, therefore 
our semantics agrees with grounded semantics as desired. 

5 Relationships with Grounded and Preferred Semantics 

After having validated our proposal by means of examples, in this section we 
show that it maintains some relationships with both the grounded and preferred 
semantics. First, we consider the fourth requirement stated in previous section, 
i.e. the agreement with the grounded semantics upon the status of undefeated 
and defeated arguments (proofs are not given due to space limitations). This 
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result relies on two properties of the grounded semantics, which relate the de- 
feat status assignment prescribed by the grounded semantics to the strongly 
connected components of the defeat graph. In particular, the first considers a 
non-trivial strongly connected component whose external attackers (if any) are 
all defeated or provisionally defeated, establishing that, in this case, all of its 
nodes are provisionally defeated. 

Proposition 1. Given AF =< A, — >■>, let S G SCC(AF) he such that [S'! > 1 
and'i^ G parents*(S') 7 G (Dg(AF) UPg(AF)). Then, S C Pg(AF). 

The subsequent property introduces two subsets related to S^{E) and S^{E) 
in Definition 12, and establishes some relationships between the status assigned 
by the grounded semantics to their nodes and the constraints on E stated in 
Definition 12. 

Proposition 2. Let us consider AF =< A, — >■> and S G SCC(AF). Let C S 
be a subset of S such that 

1. A {a G S' I 3/3 G parents* (S), /3 — >■ a,/3 G Ug(AF)}; and 

2. C {a G S I 3/3 G parents* (S), /3 — >■ a, /3 G (Ug(AF) U Pg(AF))} 

and let = S \ . Then, we have that Va G a G (Dg(AF) UPg(AF)), 

and V 7 G : 

— if "f G Ug(AF), then 7 G Ug(AF4,sv); 

— if j G Dg(AF), then 7 G Dg(AF4,st/); 

The following theorem, exploiting the above properties, proves the agreement 
with grounded semantics upon the status of defeated and undefeated arguments. 

Theorem 1. Given AF =< A, — >■>, we have that VA G AA4(AF) Ug(AF) C 
AADg(AF) C (A\ A). 

Proof (Sketch). Referring to Definition 13, we consider a generic A G AAl(AF), 
and we assume recursively that, VS^ G SCC(AF), 

- VA G A7W(AF;sc/(b)) Ug(AF) C A 

- VA G FM{AFis}{E)) De(AF) C (A\ A) 

Then, reasoning by induction on the strongly connected components of AF, we 
prove that Ug(AF) C A and that Dg(AF) C (A \ A). In particular. Proposi- 
tion 1 is exploited to show that, if |SCC(AF| 5 [j(£;)) | = 1 , then all the nodes 
of (E) are provisionally defeated, so that there is nothing to prove for them 
(this happens for instance for initial strongly connected components). On the 
other hand, the main roles of Proposition 2 concern the first point of Defini- 
tion 13 and the case |SCC(AF 4 , 5 (/(£;))| > 1, where it is exploited to prove that 
if S'^(A) n A G AA 4 (AF 4 , 5 t/(£)), then the claim is satisfied for nodes of S'^(A). 

As far as preferred semantics is concerned, given a generic AF =< A, — >■> 
it is possible to prove that any preferred extension is included in one of our 
extensions, i.e. VA G AA(AF) 3A G AAl(AF) : A C A. This, in turn, entails 
that preferred semantics agrees upon the status of arguments that are defeated 
according to ours. 
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6 Conclusions 

In this paper, we have proposed a novel argumentation semantics that, while 
maintaining the same capability of discriminating floating arguments as preferred 
semantics, correctly deals with the semantic problems arising from odd- length 
cycles and satisfies a set of requirements intuitively appealing. The symmetry 
assumptions that underly our work are related to an interpretation of argumen- 
tation as a framework for defeasible reasoning, following e.g. [1], while in other 
approaches that consider argumentation as a branch of dialogue [11] it may be 
the case that a different treatment of odd and even-length cycles is appropriate. 

Our proposal builds on intuitions coming from both grounded and preferred 
semantics and, in a sense, combines the advantages of both of them, agreeing in 
several problematic examples with that among the two semantics which is closer 
to intuition. As for future work, we will investigate the relationship between our 
semantics and the notions of attack and defence lying at the heart of preferred 
semantics. 



Acknowledgments. We thank the referees for their helpful comments. 

References 

1. Pollock, J.L.: How to reason defeasibly. Artificial Intelligence 57 (1992) 1-42 

2. Parsons, S., Sierra, C., Jennings, N.: Agents that reason and negotiate by arguing. 
Journal of Logic and Computation 8 (1998) 261-292 

3. Baroni, P., Giacomin, M., Guida, G.: Extending abstract argumentation systems 
theory. Artificial Intelligence 120 (2000) 251-270 

4. Dung, P.M.: On the acceptability of arguments and its fundamental role in non- 
monotonic reasoning, logic programming, and n-person games. Artificial Intelli- 
gence 77 (1995) 321-357 

5. Prakken, H., Vreeswijk, G.A.W.: Logics for defeasible argumentation. In Gab- 
bay, D.M., Guenthner, F., eds.: Handbook of Philosophical Logic, Second Edition. 
Kluwer Academic Publishers, Dordrecht (2001) 

6. Pollock, J.L.: Justification and defeat. Artificial Intelligence 67 (1994) 377-407 

7. Vreeswijk, G.A.W.: Abstract argumentation systems. Artificial Intelligence 90 
(1997) 225-279 

8. Makinson, D., Sclechta, K.: Pleating conclusions and zombie paths: Two deep 
difficulties in the ‘directly skeptical’ approach to defeasible inheritance networks. 
Artificial Intelligence 48 (1991) 199-209 

9. Schlechta, K.: Directly sceptical inheritance cannot capture the intersection of 
extensions. Journal of Logic and Computation 3 (1993) 455-467 

10. Pollock, J.L.: Defeasible reasoning with variable degrees of justification. Artificial 
Intelligence 133 (2001) 233-282 

11. Walton, D., Krabbe, E.: Commitment in Dialogue: Basic concept of interpersonal 
reasoning. State University of New York Press, Albany NY (1995) 




On the Relation between Reiter’s Default Logic and Its 

(Major) Variants 



James P. Delgrande^ and Torsten Schaub^ 

^ School of Computing Science, Simon Fraser University, Burnaby, B.C., Canada V5A 1S6 

j imScs . sfu. ca 

^ Institut fiir Informatik, Universitat Potsdam, D-14415 Potsdam, Germany 
torstenOcs .uni-potsdam. de 



Abstract. Default logic is one of the best known and most studied of the ap- 
proaches to nonmonotonic reasoning. Subsequently, several variants of default 
logic have been proposed to give systems with properties differing from the orig- 
inal. In this paper we show that these variants are in a sense superfluous, in that 
for any of these variants of default logic, we can exactly mimic the behaviour of a 
variant in standard default logic. We accomplish this by translating a default the- 
ory under a variant interpretation into a second default theory wherein the variant 
interpretation is respected. 



1 Introduction 

Default logic [17] is one of the best known approaches to nonmonotonic reasoning. In 
this approach, classical logic is augmented by default rules of the form ° . Such 
a rule is informally interpreted as “if a is true, and /?i , . . . , /7„ are consistent with what 
is known, then conclude 7 by default”. The meaning of a rule then rests on notions 
of provability and consistency with respect to a given set of beliefs. A set of beliefs 
sanctioned by a set of default rules, with respect to an initial set of facts, is called an 
extension of this set of facts. 

However, the very generality of default logic means that it lacks several important 
properties, including existence of extensions [17] and cumulativity [12], In addition, 
differing intuitions concerning the role of default rules have led to differing opinions 
concerning other properties, including semi-monotonicity [17] and commitment to as- 
sumptions [16]. As a result, a number of modifications to the definition of a default 
extension have been proposed, resulting in a number of variants of default logic. Most 
notably these variants include constrained default logic [18,3], cumulative default logic 
[1], justified default logic [11], and rational default logic [15]. In each of these variants, 
the definition of an extension is modified, and a system with properties differing from 
the original is obtained. 

In this paper we show that these variants are in a sense superfluous, in that each variant 
can be expressed within the framework of (the original) default logic. To accomplish this, 
we make use of translations mapping a default theory under a ‘“variant” interpretation 
onto a second theory under the interpretation of the original approach, such that the 
respectively resulting extensions are in a one-to-one correspondence. In the case of 
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variant default logics that use the language of classical logic, we extend the language 
with labelled formulas. In the case of an assertional default logic, such as cumulative 
default logic, the situation is more complex since cumulative default logic makes use of 
“assertions,” which extend the language of classical logic. Here we appeal to a quotation 
operator in which we can effectively name formulas ; we then make assertions concerning 
quoted formulas by means of introduced predicates. 

Hence we provide a unification of default logics, in that, we show that the original 
formulation of default logic is expressive enough to subsume its variants. The reverse 
relation does not to hold for constrained, justified, or cumulative default logic, in that 
one cannot express default logic in terms of these variants. However, rational default 
logic can be embedded in Reiter default logic, and vice versa. The translations that we 
provide show, in a precise sense, how each variant relates to standard default logic. As 
well, the approach lends some insight into characteristics of standard default theories. 
For example, our translations implicitly provide specihc characterisations of default 
theories that are guaranteed to have extensions or are guaranteed to be semi-monotonic. 
That is, since we map variant default logics into default logic, the theories in the image 
of the mapping are guaranteed to retain properties of the original variant. 

2 Default Logic and Its Variants 

Default logic [17] augments classical logic by default rules of the form ^ 

where the constituent elements are formulas of classical propositional or hrst-order logic. 
Defaults with unbound variables are taken to stand for all corresponding instances. For 
simplicity, we deal just with singular defaults for which n = 1. A singular rule is normal 
if (3 is equivalent to 7 ; it is semi-normal if (3 implies 7 . As regards standard default 
logic, [9] shows that any default rule can be transformed into a set of semi-normal 
defaults; similarly in constrained and rational default logic multiple justifications can 
be replaced by their conjunction. Moreover the great majority of applications use only 
semi-normal defaults, so the above assumption is a reasonable restriction. We denote 
the prerequisite a of a default <5 = by Prereq{5), its justification (3 by Justif{5) 
and its consequent 7 by Conseq{6). Conversely, to ease notation, in Sect. 3 we rely on 
a function 6 to obtain the default rule in which a given prerequisite, justification, or 
consequent occurs, respectively. That is, for instance, 6{Prereq{6)) = (^.Moreover, for 
simplifying the technical results, we presuppose without loss of generality that default 
rules have unique components. To avoid confusion we will use the term default logic 
to refer solely to Reiter’s original system. Variants will be referred to as constrained 
(cumulative, justihed, etc.) default logic. Similar considerations apply to the notions of 
default extension. 

A set of default rules D and a set of formulas W form a default theory {D,W) that 
may induce 0, 1 , or multiple extensions in the following way. 

Definition 1 ([17]). Let (D, W) be a default theory. For any set S of formulas, let F{S) 
be the smallest set of formulas such that 

1. W C r{S), 

2. r{s) = Th{r{s)), 




454 



J.P. Delgrande and T. Schaub 



G D and a G r{S) and S U {/3} 1/ _L then 7 G -T(S'). 

A set of formulas E is an extension of{D, W) iff E(E) = E. 

That is, E is a fixed point of E. Any such extension represents a possible set of beliefs 
about the world at hand. For illustration, consider the default theories 

Pl,fyi)=({^,^},0) ; (1) 

{D2,W2)={{f^,^},9) . (2) 

While (Dj , Wi ) admits one extension, Th{{C, D}), the only extension of (I?2: ^ 2 ) 
Jh{{C}). In the literature (Di , FFj ) is often used as an illustrative example for what is 
sometimes referred to as commitment to assumption [16] (or: regularity [4]); similarly 
{D 2 , W 2 ) illustrates semi-monotonicity [17]. 

Lukaszewicz [11] modifies default logic by attaching constraints to extensions in 
order to strengthen the applicability condition of default rules. A justified extension 
(called a modified extension in [11]) is defined as follows. 

Definition 2 ([11]). Let (D, W) be a default theory. For any pair of sets of formulas 
(S', T) let E{S, T) be the pair of smallest sets of formulas S' , T' such that 

1. WCS'. 

2. Th(S') = S', 

3. for any G D, if a G S' and S U { 7 } U {t]} \f Lfor every rj G T U {/3} then 
y G S' and (3 G T' . 

A set of formulas E is a Justified extension of {D, W) for a set of formulas J iffE{E, J) = 
{E,J). 

So a default rule applies if all justifications of other applying default rules are con- 
sistent with the considered extension E and 7 , and if additionally 7 and (3 are consistent 
with E. The set of justifications J need not be deductively closed nor consistent. 

In our examples, (D j , ) has one justified extension, containing C and D. How- 
ever, theory (I?2,1T2) justified extensions, one with C and one containing 

D. 

In [18,3] constrained default logic is defined. The central idea is that the justifications 
and consequents of a default rule jointly provide a context or assumption set for default 
rule application. The definition of a constrained extension is as follows. 

Definition 3 ([3]). Let {D, W) be a default theory. For any set of formulas T, let E(T) 
be the pair of smallest sets of formulas {S' , T') such that 

1. W C S' Q T', 

2. S' = Th{S') and T' = Th{T'), 

3. for any G D, if a G S' and T U {/3} U { 7 } 1/ _L then j G S' and /3 A 7 G T'. 

A pair of sets of formulas {E, C) is a constrained extension of {D, W) iff E{C) = 
{E,C). 
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Unlike Lukaszewicz’s approach, the contextual information is here a dednctively closed 
superset of the actual extension. 

In our example, lUj) has two constrained extensions, one containing C and 
another including D. Also, theory {D 2 , W 2 ) has two constrained extensions, one with 
C and one with D. 

The following is an alternative characterisation of rational extensions, originally 
proposed in [14], given in [10]: 

Definition 4 ([14]). Let {D, W) be a default theory. For any set of formulas T let F{T) 
be the pair of smallest sets of formulas {S' , T') such that 

1. W C S' C T', 

2. S' = Th{S') andT' = Th{T'), 

3. for any G D, if a G S' and T U {ff} \f _L then y G S' and P Ay G T'. 

A pair of sets of formulas {E, C) is a rational extension of{D, W) iff F{C) = {E, C). 

This definition is the same as that of constrained default logic, except for the consistency 
check. As with constrained default logic, (Uj,IUj) has two rational extensions, one 
containing C and one including D. However, theory {D 2 , W 2 ) has only one rational 
extension with C. 

Brewka [1] describes a variant of default logic where the applicability condition 
for default rules is strengthened, and the justification for adopting a default conclusion 
is made explicit. In order to keep track of implicit assumptions, Brewka introduces 
assertions, or formulas labeled with the set of justifications and consequents of the default 
rules which were used for deriving them. Intuitively, assertions represent formulas along 
with the reasons for believing them. 

Definition 5 ([!]). Let a, 7 i, . . . , 7 ^ be formulas. An assertion f is any expression of 
the form (a, { 71 , . . . , 7 m})) where a = Form{f) is called the asserted formula and the 
set {yi, . . . , 7 m } = Supp{f) is called the support o/a.' 

To correctly propagate the supports, the classical inference relation is extended as fol- 
lows. 

Definition 6 ([!]). Let S be a set of assertions. Then Th{S ), the assertional consequence 
closure operator, is the smallest set of assertions such that 

1. SCJh{S), ^ 

2. iffi, . . . ,£,n G Th{S) and Form{fi) , . . . , Form{fn) 1“ 7 then 
( 7 , Supp{^i) U • • • U Supp{fn)) G Th{S). 

An assertional default theory is a pair {D,W), where D is a set of default rules and >V 
is a set of assertions. An assertional extension is defined as follows. 

Definition 7 ([!]). Let{D, yV) bean assertional default theory. For any set of assertions 
S let F{S) be the smallest set of assertions S' such that 

* The two projections extend to sets of assertions in the obvious way. We sometimes misuse Supp 
for denoting the support of an asserted formula, e.g. (a, Supp{a)). 
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1 . WC5', 

2. Th{S') = S', 

3. for any G D, if {a, Supp{a)) G S' and Form{S) U Supp{S) U {(3} U { 7 } 1/ 
_L ^ 

then ( 7 , Supp{a) U {/?} U { 7 }) G S' . 

A set of assertions £ is an assertional extension of {D, W) iff r{£) = £. 

For illustration, consider the assertional default theory (often used for illustrating 
the failure of cumulativity [ 12 ]) 



= (3) 

This theory has one assertional extension, including {A, {A}) as well as {A V B, {A}). 
Adding the latter assertion to the set of assertional facts yields the assertional default 
theory 



P 4 ,VF 4 ) = (4) 

which has the same assertional extension. Note that without the support {A} for Ay B, 
one obtains a second assertional extension with {~^A, {-lA}). This is what happens in 
the previously-described default logics. 

It is well-known that cumulative and constrained extensions are equivalent whenever 
the underlying facts contain no support. Similar relationships are given among original 
and Q-default logic [5], justified and affirmative [10], rational and CA-default logic [5], 
respectively (cf. [ 10 ]). 



3 Correspondence with Constrained, Justified, and Rational 
Default Logic 

This section presents encodings for representing major variant default logics in Reiter’s 
default logic. For a default theory A, we produce a translated theory 7]c4\, such that there 
is a 1-1 correspondence between the extensions of A in a;-default logic and (standard) 
extensions ofTxA. We begin with constrained and rational default logic, whose encoding 
is less involved, then consider that of justified default logic. 



3.1 Correspondence with Constrained Default Logic 

For a language £ over alphabet V, let C be the language over V' = {p' \ p G V}. For 
a formula a, let a' be the formula obtained by replacing any symbol p G P by p'; in 
addition define for a set W of formulas, W' = {a' \ a G W}. 



Definition 8. For default theory {D, W), define Tc{D, W) = {D^, Wc) where 



Wc = WUW' 



and 



Dr = 



/ a : F f\A 

■( 7A(/3'A7') 



cx : (3 



G D 



}■ 
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Informally, we retain the justification of an applied default rule in an extension, but as a 
primed formula; this set of primed formulas then corresponds to the set C in Definition 3. 
Thus we essentially encode Definition 3, but in a standard default theory. Other variants 
of default logic are similarly encoded, although sometimes in a somewhat more complex 
formulation. For our examples in (1) and (2), we obtain: 

Now, theory W^) yields two extensions in standard default logic, one contain- 

ing C A B' A C and the other including D A -•B' A D' . Analogously, we obtain two 
extensions from Tc{D 2 , W 2 ), one with C A B' A C and the other with D A ~^C' A D' . 
In general, we have the following result. 

Theorem 1. For a default theory (D, W), we have that 

1. if{E, C) is a constrained extension of{D, W) then Th{E U C') is an extension of 
Tc{D, W); 

2. if E is an extension ofTc{D, W) then {F C\ L, {ip \ Lp' & F C\ £'}) is a constrained 
extension of {D,W). 



Theorem 2. The constrained extensions of a default theory {D,W) and the extensions 
of the translation Ec{D, W) are in a 1-1 correspondence. 



3.2 Correspondence with Rational Default Logic 

As expected, the mapping of rational default logic into standard default logic is close to 
that of constrained default logic: 

Definition 9. For default theory (D, W), define Tr{D, W) = Wf) where 



Wr = WUW' 



and 



Dr = 



PC : (3' 

7A(/3'A7^' 



Q : f3 
7 



G D 



}- 



As before, the consequent of rules in Dr encodes the formulas in a rational extension 
(Definition 4). For our examples in (1) and (2), we obtain: 



Tr{Di,W\) — (I CaB^AC' > 

%{D2, W 2 ) = (I C/\B'/\C ’ D/\^C'/\D' } ’ ■ 



As with theory TciD ^ , Wi), theory %{D^ , Wi ) yields two extensions, one containing 
C A B' A C and D A ~<B' A D' , respectively. In contrast to W 2 ), however, we 

obtain one extension from %.{D 2 , W 2 ), containing C A B' A C . 

In general, we have the following result. 

Theorem 3. For a default theory {D, W), we have that 
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1. if {E, C) is a rational extension of (D, W) then Th{E U C') is an extension of 
%{D,W); 

2. if E is an extension of%-{D, W) then {E C\ C,{(p \ Lp' £ E C'}) is a rational 
extension of {D,W). 

As with Theorem 2, one can show that the extensions of a default theory {D,W) and 
the translation Tr{D, W) are in a 1-1 correspondence. 



3.3 Correspondence with Justified Default Logic 

Define for a language C over alphabets and some set S, the family {C^)s^s of languages 
over V’’ = {p® I p G V} for s G S. For a G C and s G S, let a® be the formula obtained 
by replacing every symbol p G P in a by p®; in addition define for a set W of formulas, 
= {a^ \ aG W}. 

In what follows, we let the set of default rules D induce copies of the original 
language. 

Definition 10. For default theory {D,W), define Ej{D,W) = (Dj,Wj) where 



Wj = W\J and Dj 



a : (P'^A7'’)A(A^go 7^) 
7A(/3* A7'’)A(A<;gD 7'^) 



s = 



a : /3 



G D 



For simplicity, we write /3 = Justif°{6) whenever 7 ms'P/(i5) = A 7 “^) A (Acer* T'’)- 
Abbreviating the two default rules in both examples, ( 1 ) and (2), by i51 , <52 and 51 , i54, 
respectively, we get (after removing duplicates): 



A(Di,ICi)=({ 

TfiD2,W2)=[{ 



’ DA-.B'52An'52AD'5i J ’ 

:B“aC®^AC'’'‘ : -.C^Air’^AB" 1 a\ 

CaB-'IaC-'IaC*'* ’ £>A-.C‘'«A£)*4aB''i J > 



In standard default logic, theory Tj{Di, FFj ) results in one extension containing C, D, 
^<51 (7<5i £)(5i, -iiJ '^2 (7<52 £)(52 this, 7 )(D 2 , W 2 ) gives two extensions, 

one with C A A A and another including D A A A 
We have the following general result. 

Theorem 4. For a default theory {D, W), we have that 

1. if {E, J) is a justified extension of{D, W) then 

F = Th(^E U Ucjec U/Je extension ofTj{D, W); 

2. if F is an extension ofEj {D, W) then (FnC, J) is a justified extension of{D, W), 
where J = {P \ P = Justif°{5) and 5 G GD{7j{D, IF), F)}. 

GD{Tj{D, W),F) gives the set of default rules generating F; see the full version for a 
formal definition. 

In analogy to Theorem 2, one can show that the extensions of a default theory (F, IF) 
and the translation 7)(F, IF) are in a 1-1 correspondence. 
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3.4 Correspondence with (Standard) Default Logic 



We can show that there is a self-embedding for standard default logic to standard default 
logic, using the encoding of the previous subsection: 

Definition 11. For default theory {D, W), define Td{D, W) = {Dd, Wd) where 



Wd = WU\J^^j,Wi andDd=[ 



a-.ff'’ 

7A(/3‘'A7'')A(AfgD T'‘-) 



<5 = 



G£> 



One can show that this mapping results in extensions that are in a 1-1 correspondence to 
those of the original theory. That is, one obtains a result similar to that in Theorem 4. This 
embedding also illustrates in a different fashion how default logic and justified default 
logic relate. As well, this translation allows for embedding standard default logic into 
rational default logic, as made precise next. 

Theorem 5. For a default theory (D, W), we have that 



1. if E is an extension of{D, W) then (F, F) is a rational extension ofTd{D, W), 

where F = Th(^E E< U 

2. if (F, F) is a rational extension ofTd{D, W) then FdCis an extension of{D,W). 

As before, one can show that the extensions of a default theory {D,W) and the translation 
Td{D, W) are in a 1-1 correspondance. 

For our examples in (1) and (2), we get: 



rd{Di,Wi) 

Td{D2,W2) 



({ 

({ 



CaB*1aC''1aC-'2 j DA-.B*2/^£)62/^£)ai 




In contrast to the two rational extensions obtained from (Dj , IFj ), theory 72(0 ^ , Wi ) 
results in one rational extension containing C, D, and D^'^. 

As well, Td{F> 2 , IT 2 ) gives one rational extension with C A B^^ A A 

Note that a corresponding mapping into justified or constrained default logic is im- 
possible; this is not a matter of the specific translation but rather a principal impossibility. 



Theorem 6. There is no mapping T such that for any default theory (D, W), we have 
that the extensions of{D, W) are in a 1-1 correspondance with the constrained/justified 
extensions ofT{D,W). 

To see this, consider theory ( | > having no extension. On the other hand, it is well 

known that every default theory has at least one justified and constrained extension [11, 

3]. 

Finally, we note that a correspondence, as expressed in Theorem 5, can be established 
between justified and constrained extensions; we omit the details. 
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4 Correspondence with Cumulative Default Logic 

This section presents an encoding for representing cumulative default logic and cu- 
mulative extensions in default logic. In order to be able to talk about an assertion 
{a, {(3 \, . . . , fin}) within a (classical, logical) theory, an assertion is reified as an atomic 
formula where each argument is a reified formula that does not contain an 

instance of Thus (a, {/?i, . . . , is represented in the object language as 

{a, /3i A • • • A fin)"^^ ■ So that translated assertions have appropriate properties, we em- 
ploy a set of formulas Ax re axiomatising the reified formulas: 

Definition 12. Axre is the least set containing instances of the following schemata: 

1. If\- a then (a,0)’’® G Axre- 

2. (/3i=/32)D((a,/3i)- = (a,/32)"'=). 

5. A{aD D 

We have the following analogue of Definition 6 : 

Theorem 7. If{ai,Pi)^f {a 2 , /32)''" G P and {«i, ^ 2 } ^ 7 then F h ( 7 , /3i A fifi'^P 

From this we establish a correspondence between extensions of cumulative default 
logic and default logic. We first define correspondences between assertions and formulas 
of classical logic. 

Definition 13. 

For r a set of assertions, define Re {F) = | (a,P) G F}. 

For F a set of formulas of classical logic, define Re~^{F) = {{a,P) \ (a,/?)’’® G 

r}- 

For F a set of assertions, define Re~^{F) = Re (T) U Form{F) U Supp{F) U Axre- 



Definition 14. For assertional default theory {D, W), define Fa{D, W) = {Da, Wa) 
where 



Wa = i?e+(W) and 



Da = 



: /3A7 

(7,i/)A/3A7)^ A/3A7 



^ G A V'G/: 



The superscript re on formulas or sets of formulas indicates that these (sets of) formulas 
are in the image of our mapping, and are intended to be components satisfying a definition 
of a (Reiter) default extension. 

This translation nicely shows that the support of (reified) assertions is only needed 
for keeping track of underlying assumptions when adding default conclusions to the set 
of facts; the consistency check remains unaffected. In fact, the treatment of /3 A 7 in 
Definition 14 is identical to that of fi' A 7 ' in Definition 8 . 

Consider our examples in (1) and (2): 



Ta(i?3,fC3) = ({ 
Ta{D^,W^) = (^[ 



{T,y,y‘:A {AVB,y)"-.^A 

{A,'ipAA)^^AA^ {^A,y>A^A)"‘A^A 

{T,i))"-.A {A\/B,ip)" -.^A 

{A,ipAA)^‘AA' 



■0 G , AXre^ 

G , AXre U 



{{A V B, {A})^^} U {A V B} U {A}) 
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Both theories Ta{D^,W^) and yield one extension in standard default 

logic, containing {A, 

We have the following general result. 

Theorem 8. For an assertional default theory {D, W), we have that 

1. if £ is an assertional extension of {D,W), then Th(^Re~^{£)) is an extension of 

Ta{D,W): 

2. if E is an extension of Ta{D,W), then Re ^{E) is an assertional extension of 

P,W). 

Similar to the previous results, we also have a 1 - 1 correspondence between the extensions 
of a default theory and the extensions of the translation. 

The translation given here for cumulative default logic is different from the previous 
translations, which clearly yielded a polynomial increase in size of the translated over the 
original theory. In the present case, Definition 14 gives an infinite number of defaults (due 
to the presence of if in the formula schemata). However, in practice we can nonetheless 
work with a translated theory that is only a polynomial increase over the original. First, 
for an assertional extension £ and its translated counterpart Th(^Re^{£)), we clearly 
have a 1-1 mapping between the respective sets of generating defaults. Second, any 
instantiation of if in Definition 14 (corresponding to the support of the prerequisite) can 
only draw upon elements of W or consequents of members of D\ hence the size of any 
translated rule will be bounded by |IF| x |D|. As a result, an intelligent default prover 
can be restricted to a subset of the translated theory that is at worst a polynomial increase 
in size over the original. 

5 Concluding Remarks 

We have shown how variants of default logic can be expressed in Reiter’s original 
approach. Similarly, we have shown that rational default logic and default logic may be 
encoded, one into the other. This work then complements previous work in nonmonotonic 
reasoning which has shown links between (seeming) disparate approaches. Here we show 
links between (seemingly) disparate variants of default logic. As well, the translations 
clearly illustrate the relationships between alternative approaches to default logic. In 
fact, there is a division between default logic and rational default logic on the one 
hand, and the remaining variants on the other, manifesting itself through the property of 
semi-monotonicity. Although it has often been informally argued that the computational 
advantages^ of semi-monotonicity are offset by a loss of representational power, this 
claim has up to now not been formally sustained. The results reported in [9] provide 
another indication of the relation between semi-monotonicity and expressiveness : normal 
default logic is a semi-monotonic fragment of Reiter’s default logic and strictly less 
expressive than default logic. The same can be stated about cumulativity, as prerequisite- 
free, normal default logic (which corresponds to parallel circumscription) is strictly less 
expressive than normal default logic. 

^ Semi-monotonicity allows for incremental constructions, also guaranteeing the existence of 
extensions. 
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Our contributions can also be seen as a refinement of the investigations of complexity 
and/or expressiveness conducted in [7,19,13,8,6,9]. From the perspective of complexity, 
there were of course hints that such mappings are possible. First, it is well-known that 
the reasoning problems of all considered variants are at the second level of the poly- 
nomial hierarchy [7,19]. The same is true for the “existence of extensions” problem in 
default logic and rational default logic, while it is trivial in justified and constrained 
default logic (and analogously for the respective assertional counterparts). In view of 
the same complexity of reasoning tasks, observe that our impossibility claim expressed 
in Theorem 6 is about the non-existence of corresponding sets of extensions. This does 
not exclude the possibility of an encoding of incoherent Reiter or rational default the- 
ories in a semi-monotonic variant that, for instance, indicates incoherence through a 
special-purpose symbol. However, there would be no 1-1 mapping here, since for any 
justified or constrained extension containing this special-purpose symbol, there would 
be no corresponding standard or rational extension. 

The most closely related work to our own is that of Tomi Janhunen [9], who has inves- 
tigated translations among specific subclasses of Reiter’s default logic. For instance, he 
gives a translation mapping arbitrary default theories into semi-normal theories, show- 
ing that semi-normal default theories are as expressive as general ones. Other translation 
schemes can be found in [13], where among others the notion of semi-representability is 
introduced. This concept deals with the representation of default theories within restricted 
subclasses of default theories over an extended language. Although semi-representability 
adheres to a fixed interpretation of default logic, one can view our results as semi- 
representation results among different interpretations of default theories. As regards 
future research, it would be interesting to see whether the results presented here lead to 
new relationships in the hierarchy of non-monotonic logics established in [9]. Also, a 
more detailed analysis of time and space complexity is an issue of future research. 

The present work may also, in fact, lend insight into computational characteristics of 
default logic. For example, our mappings provide specific syntactic characterisations of 
default theories that are guaranteed to have extensions. That is, for example, constrained 
default theories are guaranteed to have extensions; hence default theories appearing in 
the image of our mapping (Definition 8) are guaranteed to have extensions. 

Apart from the theoretical insights, the great advantage of mappings such as we have 
given, is that it suffices to have one general implementation of default logic for capturing 
a whole variety of different approaches. In this respect, our results allow us to handle all 
sorts of default logics by standard default logic implementations, such as DeReS [2]. 
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Abstract. Inconsistencies inevitably arise in knowledge during practical reason- 
ing. In a logic-based approach, this gives rise to the need for consistency checking. 
Unfortunately, this can be difficult. In classical propositional logic, this is in- 
tractable. However, there is a useful alternative to the notion of consistency called 
probable consistency. This offers a weakening of classical consistency checking 
where polynomial time tests are done on a set of formulae to determine the proba- 
bility that the set of formulae is consistent. In this paper, we present a framework 
for probable consistency checking for sets of clauses, and analyse some classes 
of polynomial time tests. 



1 Introduction 

In order to manage inconsistency in knowledge, we need to undertake consistency check- 
ing. However, consistency checking is inherently intractable in the case of propositional 
classical logic. To address this problem, we can consider using (A) tractable subsets of 
classical logic (for example binary disjunctions of literals [3]), (B) heuristics to direct 
the search for a model^ (for example in semantic tableau [7], GSAT [10], and constraint 
satisfaction [2]), (C) some form of knowledge compilation (for example [6,1]), and (D) 
formalization of approximate consistency checking based on notions described below, 
such as approximate entailment, and partial and probable consistency. 

Approximate Entailment. Proposed in [5], and developed in [9,4], classical entailment 
is approximated by two sequences of entailment relations. The first is sound but not 
complete, and the second is complete but not sound. Both sequences converge to 
classical entailment. For a set of propositional formulae A, a formula a, and an 
approximate entailment relation \=i, the decision of whether A \=i a holds or 
A a holds can be computed in polynomial time. 

Partial Consistency. Consistency checking does not necessarily involve an exponential 
search space. Furthermore, consistency checking for a set of formulae A can be 

* Heuristic approaches can be either complete such as semantic tableau or incomplete such as in 
the GSAT system. Whilst in general, using heuristics to direct search has the same worst-case 
computational properties as undirected search, it can offer better performance in practice for 
some classes of theories. Note, heuristic approaches do not tend to be oriented to offering any 
analysis of theories beyond a decision on consistency. 
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prematurely terminated when the search space exceeds some threshold. When the 
checking of A is prematurely terminated, partial consistency is the degree to which 
A is consistent. This can be measured in a number of ways including the proportion 
of formulae from A that can be shown to form a consistent subset of A. Maximum 
generalized satisfiability [ 8 ] may be viewed as an example of this. 

Probable Consistency. Determining the probability that a set of formulae is consistent 
on the basis of polynomial time classifications of those formulae. Classifications 
for the propositional case can be based on tests including counting the number of 
different propositional letters, counting the multiple occurrences of each proposi- 
tional letter, and determining the degree of nesting for each logical symbol. The 
more a set of formulae is tested, the greater the confidence in the probability value 
for consistency/inconsistency, but this is at the cost of undertaking the tests. 

Identifying approximate consistency for a set of formulae A is obviously not a 
guarantee that A is consistent. However, approximate consistency checking is useful 
because it helps focus where problems possibly lie in A, and so prioritize resolution 
tasks. For example, if A and F are two parts of a larger knowledgebase that is thought 
to be inconsistent, and the probability of consistency is much greater for A than F, then 
F is more likely to be problematical and so should be examined more closely. Similarly, 
if A and F are two parts of a larger knowledgebase that is thought to be inconsistent, 
and a partial consistency identified for A is greater than for F, then F seems to contain 
more problematical data and so should be examined more closely by the user. 

The notions of probable consistency and partial consistency provide complementary 
means for reasoning with knowledge. For a set of formulae A, with partial consistency, 
we may identify subsets of A that are consistent, but have no certainty on whether A is 
consistent, whereas with probable consistency, we can obtain a view on the whole of A, 
but cannot guarantee to find consistent subsets even if they exist. 

In this paper, we explore probable consistency checking in more detail. In the next 
section, we provide some basic definitions, then in the following sections we formalize 
probable consistency, and give an example in detail. 



2 Basic Definitions for Syntax 

In this section, we recall some of the usual definitions for classical logic, and then provide 
some additional definitions for analysing the syntax of formulae that will be used in the 
rest of the paper. 

Definition 1. Let C atoms be a set o/atoms and let C be the set o/classical propositional 
{ormyAae formed from Catoms, ond the A, V, — >■ and connectives. For each atom a G 
J^atoms, ce is a positive literal and ~<a is a negative literal. Let Cuterais be the set of 
literals in C. 



Example 1. From the propositional atoms a, f3 and 7 , members of C include a, /3 A 7 , 
~^a A a and (a A /?) — >■ ->-■ 7 . 
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Definition 2 . For «i V .. V a„ € C, ai V .. V a„ is a clause iff each ofai,..,a„ is a 
literal. Let C G C be the set of clauses. Let Ci G C be the set of clauses of arity i (i.e. 
the clauses that are a disjunction ofi literals). 

So, Cl is the set of literals, C 2 is the set of binary clauses, and C 3 is the set of ternary 
clauses. 

Definition 3 . Let a be a literal, then a* is the complementary literal. So if [3 is an atom, 
then f3* is ~<l3, and is j3. 

Definition 4 . For clauses, a V / 3 , -i /3 V 7, a V 7 G £, a V 7 w a resolvent of a V P and 
-1/9 V 7. 

Definitions. Let the atoms function, denoted Atoms, be a function from p(£) into 
^atoms such that Atoms(I^) gives the set of atoms used in F. 

Example!. Foret ,/?, 7 G Catoms. Atoms({aA/9}) = {a, /?}, and Atoms({aAa, (-laA 
/?A 7 ) -)> a}) = {a,/ 3 , 7 }. 

Definition 6. Let Abe a finite subset of C atoms- Let be the subset of C where 
C'^ = {a G £ I Atoms({ct}) C 71} 

So, Citams = -4, and = 7l U {-■« | a G 71} = C}^, and Cf = C* n C-^. 

3 Probable Consistency for Sets of Clauses 

Suppose a set of formulae F is either {a V a, -la V -ict} or {a V -•a, /? V 7 }, and it is 
equally likely that it is either of them, then the probability of F being consistent is 1/2. 
Using this idea, if we can determine whether a given set of formulae U is in a particular 
class of sets of formulae where the probability of consistency is known, then we have a 
probability of consistency for F. To do this, we need polynomial time tests to analyse 
each set of formulae. These tests may include a count of the number of propositional 
letters used in the formulae in the set, a count of the number of formulae in the set, and 
a count of the number of types of logical symbols used in the formulae in the set. These 
polynomial time tests delineate classes of sets of formulae. To support this, we need to 
determine, in advance, the proportion of inconsistent formulae for these classes. 

Examples. Let T G p(C). If Atoms(U) = {«}> and \F\ = 2, where one formula is a 
positive literal and the other is a negative literal, then the sample space containing F has 
just one element, which is {a, -^a}. So the probability that F is inconsistent is 1. 

Example 4. Let F G p(C). If Atoms(U) = {a, /?}, and \F\ = 2, where one is a positive 
literal and the other is a negative literal, then the sample space containing F has 2 
elements {a, -i/3} and {/?, -^a}. So the probability that F is inconsistent is 0 . 
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In this paper, we restrict consideration to sets of clauses. Informally, given some set 
of clauses F, and the results of some tests on F, we want to determine the conditional 
probability that F is inconsistent. For this paper, we will assume a uniform distribution 
over the sample space containing F. Assuming a uniform distribution may be too re- 
strictive for some applications. Alternatives include giving a weighted distribution that 
is determined by either past usage, or predicted usage, of the formulae, or the proposi- 
tional letters used in the language. Using the uniform distribution, if 6> C p(C), and m 
elements of 0 are inconsistent, and \0\ = n, then the probability of inconsistency of 
a randomly selected member U of 0 is m/n. Also, the probability of consistency of a 
randomly selected member F of 0 is 1 — m/n. 

Example 5. Consider A = {a, -la, [3, ~'/3}. Here the probability of a randomly selected 
element of p{A) being inconsistent is 7/16. 



Definition 7. Let Q = p(C). Let Qmcon be the class of inconsistent sets in Q and let 
Qmis be the class of minimal inconsistent sets in Q. 

Qincon = & Q \ F \~ _L} 

Qrms = {re Qincon \ G S.t. F' C F} 

Suppose Qt C Q. So if by some test t, we show a set U is a member of Qt, then we 
use the conditional probability statement P {Qincon \ Qt) to get a valuation of probable 
consistency/inconsistency for F. 

Example 6. Let Qt^ = p{{a, -•«, /?, -•/3}). So P{Qincon I QtJ = 7/16. If we know 
by test tl, F G Qt^, then according to this, the probability of inconsistency for F is 
7/16. 



Example 7. Let Qt^ = p({a, -•«}). So P(Qi„con | Qta) = 1/4. If we know by test ^ 2 , 
F G Qt 2 , then according to this, the probability of inconsistency for U is 1 /4. 

Clearly, if C G Qt and P{Qmcon \ Qt) = 1 then U h _L holds. Similarly, if for all 
F e Qt, F \~ F holds, then P{Qincon \ Qt) = 1 - 

Definition 8. Let Qt^ C Q and ... and Qt„ f Q. A probable consistency system is a 

set 77 of conditional probability statements defined as follows, where p\, ..,p„ G [0, 1]. 

77 = [P{Qincon I Qti) = Pi, ■;P{Qincon \ Qt„) = Pn} ■ 

The universe /or 77 is Qt^ U ... U Qt„. 

In general, we may hnd by using a battery of test functions that a set of clauses 
7^ is a member of a number of delineated subsets of Q. If we denoted these subsets 
Qti , Qt„ , then this can be captured by a conditional probability of the form: 



P{ Qincon I Qti n...nQt„) 




468 



A. Hunter 



Couching test functions in terms of observations means that we can undertake a 
series of tests on a set of clauses, and that the net result is independent of the order in 
which the tests are done. In other words, we can easily show that all sequences of a set 
of tests give the same result. 

For a probable consistency system II, and a set of clauses F, the conditional proba- 
bility chosen is the conditional probability with the most specific reference class for the 
test results. 

Definition 9. If II is a probable consistency system, then Up and il™*" are subsets 
defined as follows, where F € Q and Qp Q Q. 

Fir = {P{Qincon \ Qti) & II \ F G Qt,} 

7T™“ = {P{Q,ncon I QtJ G Hr I 

there is no P{Qincon \ Qtj) G Hr such that Qt^ C QtJ 

This is used in the following classification, where r is a fixed threshold such as 0.5. 

IfforallP{Qincon I Qu) G IIfl^",P(Q^ncon \ QtJ < T holds, 
then F is probably consistent according to IF. 

In the same way, for F G can have analogous definitions for F is probably 

inconsistent according to FI, and for F is probably minimally inconsistent according to 
n. The later definition is potentially important in finding faults in knowledgebases and 
in finding arguments from knowledgebases. 

Whilst for small examples, the total cost of using test functions may be greater than 
undertaking a classical consistency check, in general, they can be cost-effective when 
used for larger sets of formulae and/or used for more expensive tasks such as searching 
for minimal inconsistent sets. 



4 Case Study with Binary Clauses 

Whilst consistency checking for sets of binary clauses is tractable, we will use them 
to illustrate the probable consistency checking approach. It is simple to check whether 
a set of formulae is a set of binary clauses. We can define a polynomial test in terms 
of a small finite state machine. Further simple tests can determine the number of atom 
symbols used and the number of clauses in a set. The focus of this case study is therefore 
on various classes of sets of binary clauses to determine the proportion of inconsistent 
sets within each class, and in particular, the probability that a set of binary clauses is a 
minimal inconsistent set. 

Obviously, if T G p(C^) is a singleton a set, then the probability that T is an minimal 
inconsistent set is 0. For sets of binary clauses of cardinality greater than 1, we need to 
consider the nature of minimal inconsistent subsets in a little more detail. In particular, 
we need to consider the different ways that we can construct conflicting arguments from 
sets of binary clauses. 
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4.1 Types of Inconsistency in Sets of Binary Clauses 

First we consider the types of contradictory arguments that can be constructed from a 
set of binary clauses. 

Definition 10. A resolution sequence T G p{C:^) is a sequence of clauses 
where n > 1, such that for each clause fi ( except 4>i and in the sequence, one of the 
disjuncts in fi resolves with one of the disjuncts in the immediate predecessor clause 
4>i-i in the sequence, and the other disjuncts in fi resolves with one of the disjuncts 
in the immediate successor clause 4>i+i in the sequence. The literal in fi that does not 
resolve with a literal in <j )2 is a tail. The literal in fn that does not resolve with a literal 
in fn-i is also a tail 

A resolution sequence {fi , ..., fn) gives a proof of a clause a V /? where a is the tail 
of (f>i and (3 is the tail of 

Definition 11. A chain F is a resolution sequence with tails a and (3 and there is no 
subset of A G r such that A is a resolution sequence with tails a and [3 



Example 8. {S V a,~'6 V j, (3 V -•j) is a chain. 

A chain (fi, fn) gives a minimal proof of a clause a V /3 where a is the tail of 
(f>i and (3 is the tail of fn- 

Proposition 1. Let h be the classical consequence relation. For any F G p(C^), <j) G 
C^, if F \- (f>, and F [/ F, and there is no F' G F such that F' h f, then F is a chain. 

Clearly, (^i, ..., (/>„) is a chain iff ((/)„, ..., ^i) is a chain. 

Definition 12. An isochain is a chain (fi, ...,<j)n) where for fi and fn, there is a 
disjunct in common. This disjunct is called the head of the isochain. The other disjunct 
in fi resolves with one of the disjuncts in <j> 2 - The other disjunct in <f>n resolves with one 
of the disjuncts in 4>n-i- 

An isochain gives a minimal proof of a literal. So for each isochain {fi, ..,</>„) there 
is a literal a such that {fi, ..,(f>n} I" ct, where a is the head of the isochain. 

Example 9. The following are two examples of an isochain. Both have a as head. 

{a V f3, ->f3 V 7 , -17 V (5, -'(5 V a) 

{j3 V a, ->f3 V 7 , a V -■ 7 ) 



Example 10. An isochain can use one or more of the same clauses at the start and end 
of the chain. In the following the first two clauses and the last two clauses are the same. 



{a V (3, -i/3 V 7 , -17 V 5, - 1(5 V - 17 , -i/3 V 7 , a V /3) 
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Definition 13. A superchain is a chain {4>i^ such that for some i, fi) 

is an isochain and fn) is <2 chain and so has a disjunct that resolves with 

4’i+l- 

The isochain gives a minimal proof of a literal (3, and the chain gives a minimal proof 
of a binary clause -i/3 V a, and so the superchain gives a minimal proof of a. 

Example 11. The following are two examples of a superchain 

(/3 V /3, a V -1/9) 

(-1(5 V -1(5 , 6 V - 17 , -1/9 V PV a) 



Example 12. A superchain can use one or more of the same clauses in the isochain and 
chain parts of the sequence. Consider the superchain 

{a V /9, - 1/9 V 7 , -17 V a, -i« V /9, -i/9 V 7 , -17 V 6) 

which is composed from the isochain (a V /9, -i/9 V 7 , -17 V a) and the chain (-ict V 
P, - 1/9 V 7 , -17 V (5) and have the chain (-i/9 V 7 ) in common. 

Proposition 2. Let (pi, ..., pn) be a superchain. If (pi, ..., pp is an isochain, then 
(pi+h Pn) is not an isochain. 

Propositions. Let (pi, ...,pn) be a chain. If (pi, ..., pn) is an isochain, then 
(pi, ..., pn) is not a superchain. 

Proof: Let (pi, ..., pn) be an isochain with head a. So a is a tail in pi and a is a tail 
in (/)„ . Now suppose, (pi, ..., pn) is also a superchain. Then this superchain incorporates 
an isochain with head a. So, there is a subset of {pi, .., pn} that has the same tails 
as (pi, .., pn). Therefore, (i^i, ..., pn) is not a chain. This contradiction means that it 
cannot be a superchain. □ 

Definition 14. For any E G p(C^), E is an argnment/or the literal a iff E \- a, and 
there is no E' G E such that E' h a, and E is an isochain or a superchain. 



Definition 15. Let a be a literal. A chainconflict is a pair of chains 



((pl, ...,pn),(pl, ■■;Pm)) 



such that (pi, ..., pn) is an argument for a and (pi, ..., pm) is an argument for a*. 



Proposition 4. For any A G p(C^), if A is a minimal inconsistent set then there is a 
chainconflict ((pi, .., pn), (pi, ..,pm)) such that A = [pi, .., pn} U [pi, ..,pm}- 



The converse does not hold as illustrated below. 
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Example 13. Let (7 V 7, -17 V a) be an argument for a, and let (-17 V -17, 7 V -^a) be 
an argument for -■«. But there is subset of {7 V 7, -17 V a} U {-17 V -17, 7 V -^a} that 
is inconsistent. 

Definition 16. Let (pn), (V'l) V'm)) be a chainconflict. It is a disjoint chain- 

conflict, if{(j)i, (j)n} n 1pm} = 0. otherwise it is a joint chainconflict. 

Example 14. The following is a disjoint chainconflict with the first item being an argu- 
ment for a and the second being an argument for -i«. 

((a V / 3 , -i /3 V 7, -17 V a), (5 V ( 5 , -1(5 V “■a)) 

Example 15. The following is a joint chainconflict, with the first item being an argument 
for a and the second being an argument for -■«, and (7 V ( 3 , -i /3 V 7) is a subsequence 
in common with both superchains. 

((7 V ( 3 , V 7, -'7 V (j, -'(5 V a) , (7 V ( 3 , -> f 3 V 7, -17 V -> e , e V ~'Q;)) 



Example 16. The pair ((a V [3, ~<P V -•S, <5 V a), (-•« V /3, ->f3 V -k5, S V -^a) is a joint 
chainconflict. So we have the isochain (a V (3, ^(3 V ^5, <5 V a) and the isochain {-•a V 
(3, ->f3 V -1(5 , 6 V -la). Together they conflict on a with (-i/3 V -i<5) being the chain in 
common. 

We will use this characterization of isochains and superchains in joint and disjoint 
chainconflicts to enumerate the possible minimal inconsistent sets of binary clauses. 

4.2 Combinatorics of Sets of Binary Clauses 

We now consider the combinatorics for constructing chainconflicts, and thereby gain 
the proportion of sets of binary clauses that are minimal inconsistent sets. Our approach 
here is to consider the formats for minimal inconsistent sets of binary clauses of various 
cardinalities. 

Definition 17. A clause scheme is one of the following, where i,j G N, and Xi and Xj 
are meta-variables symbols (place holders for literals), and X* (respectively X* ) has 
to be instantiated with the complement of Xi (respectively Xj). 

X, V Xj Xi V X* X* V Xj X* V X* 

Definition 18. A grounding is an assignment of a literal to a meta-variable where 
the meta-variable is given on the left-hand- side of the = symbol and the literal is in 
the right-hand-side. A grounding set is a set of groundings where each meta-variable 
occurs at most once, and each meta-variable is ground with a different atom symbol 
in the literal (i.e. for all grounding sets if Xi = ai and Xj = aj and i j, then 
Atoms({o;i}) f Atoms({o;j})j. 
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Definition 19. Let <L> be a clause scheme, and G be a grounding. The lnstantiate(tP, G) 
function uses the groundings in G to instantiate the meta-variables in <P. The result is 

an instantiation of <P. 

When a set of clause schema is instantiated, we obtain a clause from each clause 
scheme. 

Example 17. Let {Wi V X 2 , X 2 V X3, X 2 V W|} be a set of clause schema, and let 
{Xi = -i«, X 2 = -'(3, X^ = 5, X 4 = 7} be a grounding set. Then we obtain the set of 
clauses {-i« V /3 V <5, -i/3 V -■7} which is an instantiation. 

Clause schema are used to dehne a set of chainconflicts. 

Definition 20. A conflict scheme is a set of clause schema {<Pi, ..,<!>„}, such that 
if there is a grounding set G that can instantiate each of these clause schema, then 
lnstantiate(^i, G) U .. U lnstantiate(<?„, G) is a minimal inconsistent set. 

Example 18. The set {Xi V Xi,X* V X^} is a conflict scheme. If we let Xi = a, 
where a G A, then we obtain {a V a, -•a V -•a}, which is a minimal inconsistent set. 

We also require some subsidiary dehnitions. 

Definition 21. Let <Pbea set of clause schema. The function Letters(tP) gives the number 
of different meta-variable symbols used in <l>. 

Example 19. Let = {Xi V W2, V X^, X 2 V W|}. Then Letters(^) = 4. 

Definition 22. Let <1> be a set of clause schema. The function Heteroclauses(^) gives 
the number of schema in < 1 > where the first disjunct has a different meta-variable symbol 
to the second disjunct. 

Example 20. Let = {Xi V X 2 , V X3, X 2 V ^|}. Then Heteroclauses(<?) = 2. 
Here the hrst and second disjuncts of X 2 V X 2 have the same meta-variable symbol. 

Some conflict schema are symmetrical in the sense that there are n different ground- 
ing sets for each instantiation. For instance, for {Xi V Xi,Xi V -YJ"}, there are two 
groundings for the each instantiation. So for the instantiation {a V a, ~ia V -•a}, we 
have Gi = {Xi = a} and G2 = {Xi = -la}. 

Definition 23. Let <1> be a set of clause schema. The function Symmetry is defined as 
follows: If there are n different grounding sets Gi, .., G„ such that lnstantiate(^, Gi) = 
.. = lnstantiate(tP, G„), /or cac/j instantiation of<L, then Symmetry(<?) = 1/n. 

Example 21. To illustrate the Symmetry function, consider the following: 

Symmetry({Wi V W2, V X3, W2 V W|}) = 1 
Symmetry({Wi* V ATi, V Wi*}) = 1/2 
Symmetry({Xi V W2, V Xi, X/ V X2, X^ V X/}) = 1/2 
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Now we consider how we get the number of chainconflicts from a conflict scheme 
and a set of atoms. 

Definition 24. For a conflict scheme <P, the nnmber of chainconflicts that can be ob- 
tained with a set of atoms A is determined by the function Conflictcount(^, .4). We as- 
sume a chainconflict ..,4>n), (V'li --j V'm)) ts the same as ipm), ■■,4>n)) 

and so would count them only once. 



Propositions. Let f = Heteroclauses(<P) — 1, (/Heteroclauses(^) 0, and / = 0 
otherwise, g = Symmetry(tP), h = Letters(<P), and a = |.4|. Hence 



Conflictcount(<P, .4) 



7 — - x2^x2^xgifa>h 

(a-h)l n J - 



Conflictcount(tP, .4) = 0 if a < h 



Proof: For a < h, there are insufficient atom symbols in A to form a grounding 
set, and so there are 0 instantiations. For a > h, we need to justify the four terms as 
follows. The first term gives all permutations of the a atom symbols in |.4| used to 
instantiate the sequence of h atomic meta-variable symbols in <P. The second term gives 
the number of choices of positive and negative literals for the sequence of h atomic 
meta-variable symbols in <P. The third term gives the permutations for each disjunct in 
each clause being the first or second disjunct in the heterogeneous clauses except for the 
first heterogeneous clause. The fourth term eliminates the duplicate counts obtained in 
the second and third terms in cases of symmetry. □ 



Definition 25. A conflict profile, denoted L', of degree k is a set of conflict schema 
where for all “P £ W, |<P| = k, and for each minimal inconsistent set F £ p(C^), if 
1 7^1 = k, then there is exactly one conflict scheme d>' £ F such that an instantiation of 

<P' is r. 



Conflict profiles for degrees 2 to 4, are given in Tables 1 to 3. 

Proposition 6. The number of minimal inconsistent sets of cardinality k in p(C^) is 
calculated as follows, where F = {Fi , is a conflict profile of degree k: 

n 

Conflictcount(tPi, .4) 

i=l 

Proof: Each chainconflict corresponds to a minimal inconsistent set. According to 
Definition 25, the set of minimal inconsistent sets generated by each conflict scheme 
is disjoint from those generated by the other conflict schema. Therefore, the number of 
minimal inconsistent sets of cardinality k is the sum of the minimal inconsistent sets 
generated by each conflict scheme. □ 

Example 22. Let <p 2 a be defined by 2a in Table 1. Lef |.4| = 5. Since a = 5, f = 0, g = 
1/2, and h = 1, we have the following, and hence the number of minimal inconsistent 
sets of cardinality 5 in p(C^) is 5. 

= X 2^ X 2° X 1/2 = 5 



Conflictcount(^ 2 o, -4) 
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Example 23. Let ^ 3 a, be defined by 3a, ..,3c in Table 2. Let |^| = 2. The number 
of minimal inconsistent sets of cardinality 3 in p(C^) is 40, as follows. 

For <p 3 a, / = 1) 5 = Ij 2 nd ft, = 2, so Conflictcount(^ 3 o, = 16 

For ^ 36 , f = 0, g = 1, and ft = 2, so Conflictcount(<ft 3 t,, ^) = 8 

For <p 3 c, / = Ij ft = 1) and ft = 2, so Conflictcount(^ 3 c, .4) = 16 

It is straightforward to obtain conditional probabilities of inconsistency using the 
count of minimal inconsistent sets. 

Proposition 7. Leta = |^|. The number of sets of cardinality k in p{C^) is the following 
number of combinations of sets of size k, wheren = (2a)^ is the number of binary clauses 
that can be formed from the 2a literals obtained from A. 

n\ 

kl{n — k)l 



Proposition 8. The probability that any set of cardinality k is a minimal inconsistent 
set given by the following ratio, where the numerator is obtained from Proposition 6 and 
the denominator is obtained from Proposition 7. 

number of min inconsistent sets of cardinality k in p(C^) 
number of sets of cardinality k in p{C^) 

This case study is intended to indicate the promise of probable consistency checking. 
In order to count larger minimal inconsistent sets in p(C^), we need to generate conflict 
profiles of degree greater than 4. In order to render this viable, we are currently exploring 
the development of an algorithm to generate conflict profiles. In parallel, we are seeking 
more general results that would enable the counting of minimal inconsistent sets in 
p(C^), and in p{Cf-) for i > 2, more directly. 

5 Discussion 

Consistency checking is increasingly important in artificial intelligence, data and knowl- 
edge engineering, and software engineering. In approaches to knowledge representation 
and reasoning such as truth maintenance systems, argumentation systems based on iden- 
tifying consistent subsets, and default reasoning systems, consistency checking is an in- 
tegral part of inferencing. Probable consistency checking potentially offers more efficient 
inferencing, either by directing conflict resolution to the more problematical areas of the 
data, or by allowing inferences before eliminating all possibilities of an inconsistency. 

More generally, probable consistency checking is potentially important in informa- 
tion integration, requirements engineering, negotiation, and multi-agent interaction. It 
seems that the real advantage with probable consistency checking in all these activities 
is the relative ordering we can obtain rather than the absolute probability values. The 
relative ordering helps prioritize search or further analysis. 
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Table 1. A conflict profile of degree 2 is composed of conflict scheme 2a. The first line is for the 
first argument and the second line is for the second argument 





Conflict scheme 


Type 


2a 


Xi VXi 
xr V Xi* 


disjoint 

isochains 



Table 2. A conflict profile of degree 3 is composed of conflict schema 3a-3c 





Conflict scheme 


Type 




Xi V X2, XJ V Xi 
xr V Xi 


disjoint 

isochains 




Xi VXi,Xi* VX2 
X| V X| 


disjoint isochain 
- 1 - superchain 




Xi VXi,Xi* VX2 
Xi VXi,Xr VX| 


joint 

superchains 



Table 3. A conflict profile of degree 4 is composed of conflict schema 4a-4f 





Conflict scheme 


Type 


4a 


Xi VX2,X| VXa.Xg* VXi 
Xi* V xr 


disjoint 

isochains 




Xi V X2, x| V Xi , xr V Xs 
xr V xr 


disjoint isochain 
- 1 - superchain 


4c 


Xi vX2,xr VXi 

xr vxr,xs vxr 


disjoint 

isochains 


M 


Xi VX2,Xr VXi 

xr vx2*,x2 vxr 


disjoint 

isochains 


4e 


Xi VXi,Xr VX2 
XsVX3,Xr VX2* 


disjoint 

superchains 




Xi V Xi, xr V X2, xr V X3 
Xi vxi,xr vxr 


joint 

superchains 



Of course, none of these probabilities take into account psychological or cognitive 
factors in developing or handling knowledgebases. For example, psychological obser- 
vations may reveal an increased error resulting from an increased complexity of a spec- 
ification. If such an observation were sufficiently precise, then perhaps there should be 
some weighting applied to the probability of inconsistency of formulae so that larger 
formulae are more likely a priori to be inconsistent. 
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Abstract. While AGM belief revision identifies belief states with sets of 
formulas, proposals for iterated revision are usually based on more com- 
plex belief states. In this paper we investigate within the AGM framework 
several postulates embodying some aspects of iterated revision. Our main 
results are negative: when added to the AGM postulates, our postulates 
force revision to be maxichoice (whenever the new piece of information is 
inconsistent with the current beliefs the resulting belief set is maximal). 
We also compare our results to revision operators with memory and we 
investigate some postulates proposed in this framework. 



1 Introduction 

While AGM belief revision identifies belief states with sets of formulas, proposals 
for iterated revision are usually based on more complex belief states. Following 
the work of [7], they are usually represented by total pre-orders on interpreta- 
tions. In fact in [6], Darwiche and Pearl first stated their postulates (C1-C4) in 
the classical AGM framework. But it has been shown in [8,15] that (G2) is incon- 
sistent with AGM, and that under the AGM postulates (Gl) implies (G3) and 
(G4). To remove these contradictions, Darwiche and Pearl rephrased their and 
the AGM postulates in terms of epistemic states [7]. This has lead to a widely 
accepted framework for iterated revision, and most of the work on iterated belief 
revision now uses this more complex framework. 

So an interesting question investigated in this paper is which requirements 
on iteration one can consistently add to the usual AGM framework. We focus 
on the status of old information, and formulate several postulates embodying 
that aspect of iterated revision. They all express that old information about A 
determine in some way the current status of A. 

In particular, the first postulate says that if the agent was informed about A 
before revision (in the sense that either A or -•A was accepted) then the agent 
should remain informed about A after revision. 

Our second postulate is motivated by the following basic algorithm for the 
revision of a belief set i? by a new piece of information A [11,19]: first put A 
in the new belief set, then add as many old beliefs from B as possible. So the 
second postulate expresses that the corresponding operator is idempotent with 
respect to B. We also study a family of postulates that generalizes this idea. 

We also review other postulates coming from the iterated revision literature, 
in the classical belief set framework. 
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Our results are mainly negative: when added to the AGM postulates, our 
postulates lead to extreme revision operators. In particular the first two postu- 
lates force revision to be maxichoice: whenever the new piece of information is 
inconsistent with the current beliefs then the resulting belief set is maximal. 

These “impossibility results” about iterated revision in the usual AGM frame- 
work can be seen as a justification for the increase in representational complexity 
that shows up when one goes from AGM to iterated belief revision frameworks 
(see e.g. [7,15,18,13,17,4]). Instead of “flat” belief sets (alias sets of interpreta- 
tions), the latter work with epistemic states, that can be represented by pre- 
orders on interpretations. 

The paper is organized as follows. In Sect. 2 we give some definitions and 
notations. In Sect. 3 we consider the Darwiche and Pearl postulates in the AGM 
framework. More specifically, we focus on their first postulate. In Sect. 4 we in- 
vestigate the implications of trying to retain old information as much as possible. 
In Sect. 5 we explore a family of postulates, saying that re-introducing old pieces 
of information is harmless. In Sect. 6 we compare our results to revision opera- 
tors with memory [13,14] and we investigate the implications of some postulates 
coming from this work. We conclude in Sect. 7. 

2 Preliminaries 

We work with a propositional language built from a set of atomic variables, 
denoted by p, q, . . . Formulas are denoted by A,B,C,... We identify finite sets 
of formulas (that we call belief sets) with the conjunction of their elements. A 

belief set B is informed about a formula C if B \- C or B \ <C. A belief set B 

is maximal (or complete) if B is informed about every C . 

The set of all interpretations is denoted W, and the set of all belief sets 
is denoted B. For a formula B, Mod{B) denotes the set of models of B, i.e. 
Mod{B) = {w G >V : co \= B}. For a set of interpretations M C >V, FormfW) 
denotes the formula (up to logical equivalence) whose set of models is M, i.e. 
Form{W) = {B ■. oj \= B iff w G M}. 

A pre-order < is a reflexive and transitive relation. < is its strict counterpart: 
u) < to' if and only ii to < u>' and u>' ^ uj. And ~ is defined by w ~ w' iff w < w' 
and to' < CO. A pre-order is total is for all co,co' we have to < co' or oo' < co. 
min(M, <) denotes the set {co G M\$uj' G M : cj' < oo}. 

Definition 1 (AGM belief revision). An AGM belief revision operator -k is 
a function that maps a belief set B and a formula A to a belief set B -k A such 
that : 

(Rl) BkA\-A 

(R2) IfBAAFA, then BkA = BAA 

(R3) If AFT, then B k AF A 

(R4) If Bi = B 2 and Ai=A 2 , then Bi k Ai = B 2 k A 2 

(R5) {BkA) ACF Bk{AAC) 

(R6) If{BkA)ACF±, then Bk{AAC)F {BkA)AC 
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The postulates (R1-R4) are often called the basic AGM postulates, and the 
set (R1-R6) the extended AGM postulates, indicating that people consider the 
former to be more fundamental. Notice however that they do not put very hard 
constraints on It is the two last ones (R5) and (R6) that allow to state the 
below representation theorem, which says that a revision operator corresponds 
to a family of pre-orders on interpretations. (The theorem is due to Katsuno and 
Mendelzon, but the idea can be directly traced back to Grove [10].) But first we 
need the following: 

Definition 2 (faithful assignment). A function that maps each belief set B 
to a pre-order <b on interpretations is called a faithful assignment if and only 
if the following holds: 

1. If oj \= B and to' |= B, then uj c:s.b 

2. If uj \= B and lo' ^ B, then ui <b co' 

3- If Bi = B 2 , then <Bi=^B 2 

Theorem 1. A revision operator * satisfies postulates (R1-R6) if and only if 
there exists a faithful assignment that maps each belief set B to a total pre-order 
<B such that: 

Mod{B -k A) = min(Mod(A), <b) 

We say that the assignment is the faithful assignment corresponding to the 
revision operator. 

Let us now introduce a special family of revision operators, called maxichoice 
revision operators [1,9]. 

Definition 3 (maxichoice revision). A belief revision operator* is a maxi- 
choice revision operator if for every B and A, if B \ — >A then B*A is maximal. 

Maxichoice revision operators are not very satisfactory, since they are too 
precise and have a too drastic behaviour. In fact, with those operators, learning 
any piece of information that conflicts with the current beliefs, however incom- 
plete they are, causes the agent to have beliefs on any formula: for any formula 
A, either the agent believes that A holds or he believes that ^A holds. They 
are considered as an upper-bound for revision operators (the lower-bound being 
full-meet revision operators [1,9]). 

We will use a characterization of maxichoice operators on the semantical 
level. First we define: 

Definition 4. A linear faithful assignment is a faithful assignment that satisfies 

4- If OJ ^ B and to' B, then ui <b oj' or to' <b oj 

The following result is is part of the folklore in the literature on revision: 

Theorem 2. A revision operator * is a maxichoice operator if and only if its 
corresponding assignment is a linear faithful assignment. 



The proof is straightforward. 
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3 Darwiche and Pearl Postulates in the AGM Framework 

In [6], Darwiche and Pearl first stated their well-known postulates (C1-C4) in 
the classical AGM framework. 

(Cl) If Ah C, then {B-kC)-kA = B-kA 
(C2) If Ah then {B-kC)-kA = B-kA 
(C3) If B * A h C, then (S * C) * A h C 
(C4) IfS*Ah -<C, then (B * C) * A h ->C 

But it has been shown in [8,15] that (C2) is inconsistent with AGM, and 
that under the AGM postulates (Gl) implies (G3) and (G4). To remove these 
contradictions, Darwiche and Pearl rephrased their and the AGM postulates in 
terms of epistemic states [7]. 

As (Gl) is consistent with the AGM postulates, one might wonder what the 
constraints imposed by this postulate on the revision operators are like. This 
question has not been investigated as far as we know. The consistency of (Gl) 
with AGM is easily established by noticing that the full meet revision operator 
satisfies (Gl) [15]. But is this the only AGM operator satisfying (Gl), or do we 
face a wider family? 

Let us define another particular family of revision operators. 

Definition 5. Let < be a total pre-order on interpretations. A revision operator 
* is said to be imposed by < if its corresponding faithful assignment satisfies the 
following property: 

i. Ifuj^B and uj' B, then (uj <b to' iff to < lo' ). 

As far as we know, this family of operators has not been studied yet. Such 
operators are not satisfactory since the result of a revision does not depend of 
the belief set, but merely of the new piece of information (see theorem 3). This 
seems to be counter-intuitive and to go against the basic ideas behind revision. 
Nevertheless, such operators fulfill all AGM postulates, and the full meet revision 
operator is a particular case (when < is a flat pre-order, i.e. u> ~ u}',yto,to' € W). 

Theorem 3. Let * be an AGM revision operator, and let f be any function 
mapping formulas to formulas such that /(A) h A and if A\ = Ai then /(Ai) = 
/(A 2 ). * is imposed if and only if for any belief set B and formula A, the following 
holds: 

(IMP) LfB'r-^A then B-kA = f{A) . 

Proof. The only if part is straightforward: define /(A) as mm{Mod{A), <). 

For the if part we need to build the imposed pre-order < from /(A). 
This can be established by noting that if we take a formula A that has ex- 
actly two (distinct) models 00 and lo' , then by (IMP) for every B such that 
A A i? h T, we have B ^ A = /(A). By (Rl) and (R3), Mod{f{A)) = {w} or 
Mod{f{A)) = {lo'} or Mod{f{A)) = {lo,lo'}. Since * is an AGM operator, the 
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faithful assignment gives us, for every B inconsistent with A, that w <s w' when- 
ever Mod{f{A)) = {w}, u>' <B OJ whenever Mod{f{A)) = {ut'}, and a; w' 
whenever Mod{f{A)) = {u),uj'}. That means that there exists a pre-order < 
defined as w < w' iff w G Mod{f{Form{uj,ui'))) and such that for all B such 
that uj^co' ^ B, oj <B O'' iff w < uj'. 

This result states that for any revision that is not an expansion the old belief 
set is not taken into account in the result of the revision. 

Now let us return to the case of the (Cl) postulate and state the following 
result: 

Theorem 4. An AGM revision operator satisfies (Cl) if and only if it is im- 
posed. 

Proof. The if part is straighforward, since either B A A is consistent and then 
(Cl) is a consequence of (R2), or S A A is not consistent, and then (Cl) is a 
consequence of theorem 3. 

For the only if part, suppose that the operator * satisfies (R1-R6) and (Cl). 
We will show that the operator is imposed and there exists an / such that (IMP) 
is satisfied. If * satisfies (Cl) then (IMP) holds, since for every A and B such 
that AABis not consistent, by (R2) we have that B -k A= {-'A * {AM B)) k A. 
Thus by (Cl) we get that {~^A * (A V B)) k A = -'A * A, consequently we get 
BkA = -lAkA. Thus / can be defined by stipulating that f{A) = ~^Ak A. This 
means that the result of the revision depends only on the input A. 

This result casts serious doubts on the (Cl) postulate in the AGM framework. 

4 “Keep On Being Informed about A” 

When an agent receives new information she has to modify her current set of 
beliefs B in order to take it into account. One major requirement of AGM theory 
is the principle of minimal change, that means that when one revises a belief set 
by a new piece of information, one has to keep “as much as possible” of the old 
belief set. 

The following property tries to capture this intuition, by saying that revising 
by A can not induce a loss of information: if B is informed about C, then learning 
A can not lead to loose this information. 

(Compl) If B \- C then BkA\~C or BkA\ — <C 

Unfortunately it can be proved that : 

Theorem 5. Ifk satisfies (R1-R6) and (Compl), thenk is a maxichoice revision 
operator. 

Proof. This can be proved straightforwardly: suppose B \- A. If \- A then the 
theorem holds. Else we have B\- AM C and B\- AM -iC. By (Compl), Bk-'A h 
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Ay C or B -k -lA I — ‘A A ^C, and B * -■ A h A V or Bk ->A I — <A A C. Among 
the four cases, the one where Bk-^A h (AV C) A (AV -•C) is impossible because 
B k -I A h A by (R4) and 1/ A. The one where B k -lA h {->A A ~<C) A (-lA A C) 
is impossible because B k -lA h _L. It follows that B k -lA I — <C or B k -lA h C. 

It is straightforward to show that every maxichoice revision operator sat- 
isfies (Compl). Together with the preceding theorem it follows that (Compl) 
characterizes maxichoice revision. 

Remark 1. Formula (3.17) in [9] is just (COMPL) (modulo a typo). There, 
proposition (3.19) says that “B k A is maximal for any sentence A such that 
-■A G B”, i.e. (3.17) entails maxi choice revision. The proof refers to observation 
3.2 of [2], but the latter presupposes already that * is a maxichoice operator, 
and establishes that this entails maximality. 

So this postulate puts too strong a requirement on classical ACM revision 
operators. 

In the next section we will investigate another requirement also based on the 
assumption that we can keep as much as possible of the old information. 

5 “Re-introducing Old Information Doesn’t Harm” 

Another way of ensuring that one does not forget previous information is to 
suppose that we can re-introduce the old belief set without changing the current 
one. It can be seen as some kind of left-idempotency of the revision operator. This 
idea is very close to the one used for defining revision with memory operators 
[14,13,3]. 

First we need the following abbreviations. 

Definition 6. Given a set of beliefs B and pieces of information Ai, then for 
1 < i < n we define Bi by: 



Bi — {...{{B k Ai) k A 2 ) k . . .) k Ai 
Thus Bf) = B j B\ = B k A\ , and B 2 — {B k A^) k A 2 . 

Our abbreviation enables us to concisely formulate the following family of 
postulates: 

(Memd Bi = B k Bi, for i > 0 
Hence: 

(Memo) says Bq = B k Bq, i.e. B = B k B, 

(Memi) says Bi = B k Bi, i.e. B k Ai = B k [B k Ai), and 
(lVIeni 2 ) says B 2 = B k B 2 , i.e. (H k Ai) k A 2 = B k ifB k Ai) k A 2 ). 
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Let us see now what is the relation of the postulates (MeiUi) with the AGM 
postulates. 

Theorem 6. (Merrifj) is derivable from the basic AGM postulates. 

The proof only uses the postulate (R2) . 

Theorem 7. (Mem\) is derivable from the extended AGM postulates. 

Proof. From (Rl) we know that {B * A) A A = B-k A. Now using (R5) and (R6) 
with C = B k A, we have Bk(AA {B k A)) = {B k A) A {B k A) . That is directly 
B k i^B k A) = B k A. 



Theorem 8. (Mem 2 ), (Mem^), etc. cannot be derived from the AGM postulates. 

Proof. This can be established e.g. by considering Dalai’s revision operator [5], 
which is known to satisfy the AGM postulates [12] and showing that is does not 
satisfy the (Mem^) postulates. Indeed, consider B = -•p, Ai = ~^q, A 2 = p\/ q. 
Then B 2 = {-•p k -iq) k {p\J q) = {->p A -•q) k {pV q) = p (Q q where © is the 
exclusive or. But this is different from B k B 2 = ^p k {{~^p * ^q) k {py q)) 
= -ipk ((-ip A -•q) k{py q)) = -,pk{p(Bq) = -'P A q. 

We can easily find revision operators satisfying these additional postulates : 

Theorem 9. Ifkis a maxichoice revision operator then k satisfies every postu- 
late (Memi). 

The postulates of this family are ordered by strength, as shows the following 
result: 

Theorem 10. If k satisfies postulate (Memi+i) then k satisfies postulate 
(Memi). 



The other way round, (Mem^) does not always imply (Memi+i): this is im- 
mediate for z = 0. 

So is those families of operators, defined from the (Mem^) postulates, are wide 
ones ? It is not the case. We show that, once again, only maxichoice revision 
operators satisfy our postulates. 

Theorem 11. If k satisfies (R1-R6) and (Mem 2 ), then k is a maxichoice revi- 
sion operator. 

Proof. Suppose that A is consistent and that B I <A. We want to show that 

B k A is maximal, i.e. for an arbitrary C we have that either B k A \- C, or 
BkAh^C. 

First, (Mem 2 ) tells us that {-<A V C) k B k A = (-lAV C) * ((-■Av C)kBkA), 
and similarly (~'A V ~'C) k B k A = {~'A V ~'C) k ((“'A V ~'C) k B * A). As B I — 'A 
we have B = (-lA y C) k B hy (R2), and similarly B = (-lA V -^C) k B. Hence 
{->Ay C)kBkA = {->Ay C)k[[->Ay C)kBkA) = {->Ay C)k(BkA), and similarly 
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{~'A V ~'C) -k B -k A = {~'A V ~'C) -k ((“'A V ~'C) -k B k A) = (~'A V ~'C) * {B k A). 
Now suppose that not(either BkA h C, or BkA I — >C ), i.e. BkA is consistent 
with C, and BkA consistent with -iC. Then we must have {~^A V C) k B k A = 
{~'A V C) * ((“'41 V C) k B k A) = (“lA V C) * {B k A) = {~<A V C) A {B k A), and 
{~'A V ~<C) k B k A = (“lA V ~<C) k ((“lA V ~<C) k B k A) = {~'A V ~<C) k (_B k A) = 
(-■Av -iC) /\{Bk A). As BkA h A, we would have that (-lAV C) A (B k A) h C, 
and {-lA V ~<C) A {B k A) h -<C. But by AGM {-lA V C) k (B k A) must be 
consistent. 

A corollary of the theorems 10 and 11 is that a revision operator satisfies 
a (Mem^) postulate if and only if it is a maxichoice revision operator. So each 
postulate of this family is a characterisation of maxichoice operators. 

As explained at the beginning of this section, the idea of this family of postu- 
lates seems very close to the one behind the definition of revision with memory 
operators. In the next section we will investigate more deeply the links between 
revision with memory operators and the requirements on classical AGM revision 
operators. 



6 The Relation with Revision with Memory Operators 

Belief revision operators with memory [14,13] keep trace of the history of beliefs 
in order to be able to use them whenever further revisions make this possible. 
They are based on a notion of belief state that is more complex than the flat set 
of beliefs of the AGM framework. 

Basically, if we represent epistemic states ^ by a pre-order on interpretations, 
noted <<f, we can extract the associated belief set with the projection operator 
Bel{(I>) = min(W, <,f). The pre-order <<p represents the agent’s relative confi- 
dence in interpretations. For example lo <<p to' means that for the agent in the 
epistemic state ‘P the interpretation uj seems (strictly) more plausible than the 
interpretation u>' . 

The usual logical notations extend straightforwardly to epistemic states (they 
in fact denote conditions on the associated belief sets). For example ^ h C, <1>AC 
and oj \=^ respectively mean Bel{<!>) h C, Bel{(P) A C and uj ^ Bel{^). 

Now let us define revision with memory operators. This family of operators 
is parametrized by a classical AGM operator. It can be seen as a tool to change 
a classical AGM operator with bad iteration properties into an operator that 
has good ones. 

Definition 7 (revision with memory). Suppose that we dispose of a classical 
AGM operator k. (We will use its corresponding faithful assignment C -A<c-) 
Then we define the epistemic state (the pre-order) T> o C that results from the 
revision with memory of <P by the new information C as: 

^ <<foC iff ^ <C or 

to oj' and to <g> to' 
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This definition means that each incoming piece of information induces some 
credibility ordering. (The exact ordering induced depends on the classical AGM 
operator that has been chosen.^) And the new epistemic state is built by listening 
first to this incoming piece of information, and then to the old epistemic state 
(this is the well known primacy of update principle). 

In fact, it is shown in [13], that an epistemic state for revision with memory 
operators can be encoded as the history of the new pieces of information acquired 
by the agent since its “birth”. So we can suppose that the agent starts from 
an “empty” epistemic state S, that is represented by a fiat pre-order^, and 
successively accommodates all the pieces of information. So if we suppose that 
all revision sequences start from S, it can be shown that all revision with memory 
operators satisfy the (Mem^) postulates, since they all take the history of the 
revisions into account. 

Theorem 12. A revision operator with memory satisfies (Memt), Vt. 

In fact, a logical characterization for revision with memory operators has 
been given in [13]. Most of the postulates are generalizations of AGM postulates 
in the epistemic states framework, but there are also some specific postulates 
characterizing revision with memory. We will examine now their status in the 
classical belief set framework. Those postulates have been written for epistemic 
states, but we can translate them for belief sets (with some simplifications) as 
follows : 

(Histl) (B*A)i.C = Bi.{Ai.C) 

(Hist2) liC-kA = A, then {B^C)*A = B^A 
(HistS) lfC*A\- D, then (B * C) * A h H 

The first postulate expresses some kind of associativity and aims at expressing 
the strong influence of the new piece of information. The second one says that 
if a formula C does not distinguish between the models of A, then learning C 
before A is without effect on the resulting belief set. The third one says that the 
consequences of a revision also holds if we first learn another piece of information. 

The counterpart of (Histl), (Hist2) and (Hist3) for epistemic states are re- 
spectively named (H7), (H’7) and (H’8) in [13]. It is shown there that in the 
presence of the other postulates (H1-H6) (that are mainly a generalisation of 
AGM postulates in the epistemic state framework), (H7) is equivalent to (H’7- 
H’8). 

This equivalence no longer holds in the belief set framework. Let us see now 
the implications of these three postulates in this framework. 

Theorem 13. There is no operator that satisfies (RTR6) and (Histl). 

^ Note that one of the possibilities is a two level pre-order with the models of the 
formula at the lowest level, and the counter-models at the top level. That gives the 
more “classical” operator of the family [18,16,20,3]. 

^ that is u) ~s w' 
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Proof. Let 102,103 be 4 distinct interpretations. Now take four formulas 

A,B,C,D such that Mod{A) = {001,002}, Mod{B) = {wg,wi}, Mod{C) = 
{002,1.03} and Mod{D) = {001,003}. From (Histl) we have that {B * A) * C = 
B*{A*C), that is from (R2) {B AA)*C = B*{AaC). As Mod(AAC) = {002}, 
from (Rl) and (R3) it follows that Mod{B*{AAC)) = {002}, hence Mod{{BAA)* 
C) = {002}- On the other side, starting from (Histl) with {B*D)*C = B.k{D*C), 
we obtain similarly Mod{B * (Z? A C)) = Mod{{B AD)*C) = {wg}. Now notice 
that B AD = B AA, so (R4) says that {BAA)*C = {B AD).kC. Contradiction. 

Note that (Hist2) is stronger that the postulate (Cl) proposed by Darwiche 
and Pearl. As (Cl) is consistent with the ACM postulates we will consider a 
weakening of the (Hist2) postulate, that accounts for the case when A\/ C: 

(StrictHist 2 ) If C * A = A and AFC, then {B*C)*A = B.kA 



Theorem 14 . If an operator satisfies (R 1 -R 6 ) and (StrictHist 2 ), then -k is a 
maxichoice revision operator. 

Proof. We show that if * satisfies (StrictHist2), then * is maxichoice. If * is 
not maxichoice, then there exists a formula C such that <c is not linear, that 
means that we can find a formula A and two distinct interpretations 00,00' , with 
Mod{A) = {00,00'} (with 00 yf 00') such that C AA is not consistent^ and 00 00' , 

ie C -k A = A. (StrictHist2) then says that for all R {B k C) k A = B k A. 
In particular if we take B such that Mod{B) = Mod{C) U {w}, that means 
that C k A = B k A = A. But from (R2) we get that B k A = B A A, so 
Mod{B k A) = {oo}. Contradiction. 

So, as a corollary of theorems 14 et 4, every operator satisfying (R1-R6) and 
(Hist2) must be an imposed maxichoice operator. 

Theorem 15 . There is no operator that satisfies (RTR6) and (HistS). 

Proof. Let 003,001,002 be 3 distinct interpretations. Now take four formulas 
A, B, C, D such that Mod{A) = {001,002}, Mod{B) = {wo}> Mod{C) = {003,001}, 
and Mod{D) = {003,002}. As from (R2) CkA = CAA, then Mod\CkA) = {ooi}, 
so from (Hist3) and (R3), that means that Mod{{B k C) k A) = {ooi}. On the 
other side, starting from DkA, we find similarly that Mod{{BkD)kA) = {002} . 
Finally, as from (R2) we find easily that (BkC) = (BkD), from (R4) we have 
that {B k C) k A = {B k D) k A. Contradiction. 

These three results show, once again, that it is hard to try to formulate it- 
eration postulates in the ACM framework. Whereas those properties are mean- 
ingful in the epistemic state framework, two of them, (Histl) and (Hist3), are 
not consistent with ACM postulates for belief set revision, and the last one, 
(StrictHist2), implies the maxichoice property. 

When * is an AGM revision operator and C k A = A, then C A A F _L is equivalent 
to A h C. 



3 
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7 Conclusion 

Studies in iterated belief revision have been stated in the epistemic state frame- 
work mainly because of the influence of Darwiche and Pearl’s proposal [6,7] and 
its incompatibility with the AGM belief set framework. But since, few work has 
been done to see if some properties on iteration can be stated in the classical 
framework. 

We have addressed this issue in this paper by looking at some candidates 
postulates. In different ways, all of them express that the result of a revision 
must keep as much as possible of the old information. 

Our results are mainly negative. When the proposed postulates are not incon- 
sistent with classical AGM ones, they inexorably lead to the maxi choice property, 
which is far from satisfactory for a sensible revision operator. So the results ob- 
tained in this paper can be seen as “impossibility results” about iteration in the 
classical AGM framework. 

This study is then important to justify the gap, both in terms of knowledge 
representation and in terms of computational complexity, induced by all the 
iterated revision approaches that abandon the classical framework and work 
with more complex objects, viz. epistemic states. 
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Abstract. In this paper, we propose some extensions of epistemic logic 
for reasoning about information fusion. The fusion operators considered 
in this paper include majority merging, arbitration, and general 
merging. Some modalities corresponding to these fusion operators are 
added to epistemic logics and the Kripke semantics of these extended 
logics are presented. While most existing approaches treat information 
fusion operators as meta-level constructs, these operators are directly 
incorporated into our object logic language. Thus it is possible to reason 
about not only the merged results but also the fusion process in our 
logics. 

Keywords: epistemic logic, database merging, belief fusion, ma- 
jority merging, arbitration, general merging, belief revision, multi-agent 
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1 Introduction 

The philosophical analysis of knowledge and belief has stimulated the devel- 
opment of the so-called epistemic logic [21]. This kind of logic has attracted 
the attention of researchers from diverse fields such as artificial intelligence (AI), 
economics, linguistics, and theoretical computer science. Among them, the AI 
researchers and computer scientists develop some technically sophisticated for- 
malisms and apply them to the analysis of distributed and multi-agent systems 
[20,36]. 

The application of epistemic logic to AI and computer science puts its em- 
phasis on the interaction of agents, so multi-agent epistemic logic is urgently 
needed. One representative example of such logic is proposed by Fagin et al. 
[20]. The term “knowledge” is used in a broad sense in [20] to cover cases of 
belief and information.^ The most novel feature of their logic is the considera- 
tion of common knowledge and distributed knowledge among a group of agents. 
Distributed knowledge is that which can be deduced by pooling together every- 
one’s knowledge. While it is required that proper knowledge must be true, the 
belief of an agent may be wrong. Therefore, in general, there will be conflict 

^ More precisely, the logic for belief is called doxastic logic. However, here we will 
use the three terms knowledge, belief, and information interchangeably, so epistemic 
logic is assumed to cover all these notions. 
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in the beliefs to be merged. In this case, everything can be deduced from the 
distributed beliefs due to the notorious omniscience property of epistemic logic, 
so the merged result will be useless for further reasoning. 

Instead of directly putting all beliefs of the agents together, there are other 
sophisticated techniques for knowledge base merging [12,15,25,26,27,33,34,35]. 
Most of the approaches treat belief fusion operators as meta-level constructs, so 
given a set of knowledge bases, these fusion operators will return the merged 
results. More precisely, a fusion operator is used to combine a set of knowl- 
edge bases Ti, T 2 , • • • , where each knowledge base is a theory in some logical 
langauge. 

Some of the above-mentioned works present concrete operators that can be 
used directly in the fusion process, while others stipulate the desirable prop- 
erties of reasonable belief fusion operators by postulates. However, few of the 
approaches provide the capability of reasoning about the fusion process. In this 
paper, we propose that belief fusion operators can be incorporated into the ob- 
ject language of the multi-agent epistemic logic, so we can reason not only with 
the merged results but also about the fusion process. 



1.1 Preliminary 

Let C denote the language of epistemic logic. The alphabet of C contains the 
following symbols: a countable set = {Pi . .} of atomic propositions; the 
propositional constants _L (falsum or falsity constant) and T (verum or truth 
constant); the binary Boolean operator V (or), and unary Boolean operator -■ 
(not); a set Ag = {1,2, ... ,n} of agents; the modal operator-forming symbols 
“[” and “]”; and the left and right parentheses “(” and “)”. 

The set of well-formed formulas(wffs)is defined as the smallest set containing 

U (T, T} and closed under Boolean operators and the following rule: 

if (/? is a wff, then [G]t/? is a wff for any nonempty G C Ag. 

The intuitive meaning of [G]ip is “The group of agents G has distributed belief 

As usual, other classical Boolean connectives A (and), D (implication), and 
= (equivalence) can be defined as abbreviations. Also, we will write {G)(p as an 
abbreviation of ~^\G]^Lp. When G is a singleton {i}, we will write \i]g} instead of 
[{t}](/?, so [i]ip means that agent i knows ip. 

For the semantics, a possible world model for £ is a structure 



(^A, {Tli) 



where 

— VF is a set of possible worlds, 

— TT-i C IF X kF is a serial binary relation^ over IF for 1 < z < n, 
A binary relation TZ is serial if Vw3u.77(ui, u). 



2 
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~ V (pQ ^ 2^ is a truth assignment mapping each atomic proposition to the 
set of worlds in which it is true. 

From the binary relations TZi’s, we can define a derived relation TZg for each 
nonempty G C Ag: 

Informally, TZi{iv) is the set of worlds that agent i considers possible under w 
according to his belief, so TZg{w) is the set of worlds that are considered possible 
under w according to the direct fusion of agents’ beliefs. The informal intuition is 
reflected in the definition of the satisfaction relation. Let M = (IF, 
be a model and <P be the set of wffs for C, then the satisfaction relation |=mC 
W X (p \s defined by the following inductive rules(we will use the infix notation 
for the relation and omit the subscript M for convenience): 

1. w \= piA w & V (p), for each p G <Pq, 

2. w -L and w \=m T, 

3. ru 1= -ip iff w ^ 

4. w\=(p\/'ipiSw\=(porw\='ip, 

5. ru 1= [G]p iff for all u G TZg{w), u \= p. 

In the presentation below, we will extensively use the notions of pre-order. 
Let S' be a set, then a pre-order over S is a reflexive and transitive binary relation 
< on S. A pre-order over S is called total (or connected) if for all x,y £ S, either 
X < y or y < X holds. We will write x < y a,s the abbreviation ofx < y and y x. 
For a subset S' of S, min(S', <) is defined as the set {cc G S' | Vj/ G S', y yl x}. 



2 Merging by Majority 

Majority voting is a method to resolve conflict between agents. For example, if 
three knowledge bases Ti = {p},T 2 = {p}, and = {-'p} are combined, then 
the result would be {p}, since two vote for p, whereas only one votes against it. 

One of the most general merging functions based on majority is defined 
in [34]. A function Merge is applied to weighted knowledge bases. Let wt : 
{Ti, T 2 , • • • , Tfc} — >■ R~^ be a weight function which assigns a positive real num- 
ber to each component knowledge base, then a total pre-order over the set of 
propositional interpretations is defined as: 

fc k 

W w' iff ^ dist{w, Ti) ■ wtiTi) < ^ dist{w' , Ti) ■ wt{Ti), 

i=l 

where dist is a function denoting the distance between a propositional inter- 
pretation and a knowledge base. When the propositional language is finite, the 
so-called Dalai distance (or Hamming distance) between two interpretations of 
the language is used [16]. It is defined as the number of atoms whose valuations 
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differs in the two interpretations. Let dist{w,w') denote the Dalai distance be- 
tween two interpretations w and w', then the distance from w to a, theory T, 
denoted by dist{w,T), is defined as: 

dist{w,T) = mm{dist{w,w') \ w' ^ T}. 

The merged result Merge{Ti,T 2 , ■ ■ ■ ,Tk, wt) is defined as: 

{ip I Vw G min{Q, ^), ru |= p}, 

where 17 is the set of all propositional interpretations and ^ is 

This kind of weighted merging operator can be incorporated into epis- 
temic logic in the following way. Syntactically, a new class of modal operators 
[M{G, wt)] for any nonempty G C {1, 2, • • • , n} and weight function wt : Ag — >■ 
R+ is added to our logic language. Then the semantics for the new modal op- 
erators is defined by extending a possible world model to (IT, {TZi)i<i<n, V,p), 
where (fb, (7^i)i<i<„, T) is an C model, whereas p : W x W ^ U {0} is 
a distance metric function between possible worlds satisfying p{w, w) = 0 and 
p{w,w') = p{w',w). 

The distance metric between possible worlds is defined as in the semantics 
of conditional logic [37,40]. The distance from a possible world w to the belief 
state of an agent i in the possible world u is defined by: 

distu{w,i) = inf{/i(r(;, w') | {u,w') G TZi}. 

Then a total pre-order wt) possible worlds is defined for each possible 

world u and modal operator [M{G,wt)]: 

w Afa,wt) w' iff ^ distu{w, i) ■ wt{i) < distu{w' , i) ■ wt{i). 
ieG ieG 

The most straightforward definition for the satisfaction of the wff [M{G,wt)]p 
is: 

u \= [M{G,wt)]p iff for all w G min(fb, ,wt)) ^ H 

However, since for infinite W, the set min(fT, may be empty, the defini- 
tion may result in m |= [M{G, wt)]T in some cases. Alternatively, since is 

a total pre-order, it is simply a system-of-spheres in the semantics of conditional 
logic [37], so we can define the satisfaction of the wff [M{G,wt)]p by 

u ]= [M{G,wt)]p iff there exists wq such that for all w A{)G,wt) h ‘P- 

Note that the function wt is used only for encoding the reliability of agents. 
It is tempting to propagate the weights into a group of agents so that we have 
a weight wt{G) for each group G. This weight may be useful in the belief fusion 
of two groups of agents. However, we do not really need this because if we want 
to merge the beliefs of two groups Gi and G 2 , we can simply merge the beliefs 
of agents in Gi U G 2 . 
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3 Arbitration 



The notion of distance measure between possible worlds is also used in arbitra- 
tion, another type of merging operator [32,38,39]. 

A semantic characterization for arbitration is given in [32]. A knowledge 
base in [32] is identified with a set of propositional models, thus the semantic 
characterization for this kind of arbitration is given by assigning to each subset of 
models A a binary relation <yi over the set of model sets satisfying the following 
conditions (the subscript is omitted when it means all binary relations of the 
form </i): 



transitivity: if A < B and B <C then A< C, 
if A C B then B < A, 

A < Ad B or B < Ad B, 

B <A C for every C iff A fl C yf 0, 

C <AvjB D and A <c B or 
D <AvjB C and A <jy B. 



1 . 

2 . 

3. 

4. 

5. A <CUD B 



By slightly abusing the notation, may also denote binary relations between 
models in the sense that w <a w' iff {ic} "£a The arbitration between two 

sets of models A and B is then defined as: 



AAS = min(A, <s) U min(i?, <^). (1) 

To incorporate the arbitration operator of [32] into epistemic logic, we first 
note that according to (1), the arbitration is commutative but not necessarily 
associative. Thus, the arbitration operator should be a binary operator between 
two agents. We can add a class of modal operators for arbitration into our logic 
just as in the case of majority merging. However, to be more expressive, we will 
also consider the interaction between arbitration and other epistemic operators, 
so we define the set of arbitration expressions over Ag recursively as the smallest 
set containing Ag and closed under the binary operators -I-, •, and A. Here -I- and 
• correspond respectively to the distributed belief and the so-called “everybody 
knows” operators in multi-agent epistemic logic [20]. Then the operator [G] in 
epistemic logic can be replaced with a new class of modal operators [a] where a 
is an arbitration expression. 

For the semantics, a model is extended to (IF, (7^i)i<i<„, H, <), where < 
is a function assigning to each subset of possible worlds A a binary relation 
2^ X 2^ satisfying the above-mentioned five conditions. Note that the first 
two conditions imply that <a is a pre-order over 2^ . Then for each arbitration 
expression, we can define the binary relations 'R-aAb,'Ba-b and TZa+b over W 
recursively by: 



TZaAb{w) = min(7^o(w), < 7 ^,(„)) U min(7^h('u;), 

TZa+b = TZa n TZb 
TZa-b = TZa U TZb 
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Thus the satisfaction for the wff \a\Lp is defined as: 

u 1= [a]Lp iff for all w G TZa{u),w |= ip. 

Note that the original distributed belief operator [G] is equivalent to [ii + {i 2 + 
• • ■ {ik-i + ik))] if G = {zi, 12 , • • • ) *fc}- Furthermore, it has been shown that the 
only associative arbitration satisfying postulates 7 and 8 of [32] is AAB = AUB, 
so if A is an associative arbitration satisfying those postulates, then [aAb]<p is 
reduced to [a • b]p, which is in turn equivalent to [a\p A [b]p. 

By this kind of modal operators, the postulates 2-8 of [32] can be translated 
into the following axioms: 

1. [aAb]p = [6Aa](/?, 

2. [aA6](^ D [a -I- b]p, 

3. -'[a -I- &]A D ([a -I- b]p D [aAb]p), 

4. [oA6]A D [a]A A [6]A, 

5. ([aA(5 • c)]lp = [aA&]i^) V ([oA(6 • c)]lp = \aAc\Lp) V ([aA(6 • c)]lp = [(aA6) • 
(aAc)]<^), 

6. [a](/? A \b]p D [aA6]i^, 

7. ~'[a]-L D -'[a -I- (oA6)]A. 

However, since the set of possible worlds W may be infinite in our logic, the 
minimal models in (3) may not exist, so the axioms 4 and 7 are not sound with 
respect to the semantics. To make them sound, we must add the following limit 
assumption [2] to the binary relations for any A C W: 

for any nonempty U QW , min(C/, <a) is nonempty. 

4 General Merging 

In [26], an axiomatic framework unifying the majority merging and arbitration 
operators is presented. A set of postulates common to majority and arbitration 
operators is first proposed to characterize the general merging operators and then 
additional postulates for differentiating them are considered respectively. In that 
framework, a knowledge base is also a finite set of propositional sentences. The 
general merging operator is defined as a mapping from a multi-set^ of knowledge 
base, called a knowledge set, to a knowledge base. Therefore, the arbitration 
operator defined via this approach can merge more than two knowledge bases, 
whereas the definition of arbitration operator in [32] is limited to two knowledge 
bases. The merging operator is denoted by A, so for each knowledge set E, A{E) 
is a knowledge base. Two equivalent semantic characterizations are also given 
for the merging operators. One is based on the so-called syncretic assignment. 
A syncretic assignment maps each knowledge set A to a pre-order <e over 
interpretations such that some conditions reflecting the postulated properties 

® A multi-set, also called a bag, is a collection of elements over some domain which 
allows multiple occurrences of elements. 
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of the merging operators must be satisfied. Then A (if) is the knowledge base 
whose models are the minimal interpretations according to <e- 

This logical framework is further extended to dealing with integrity con- 
straints in [27]. Let if be a knowledge set and (/? be a propositional sentence de- 
noting the integrity constraints, then the merging of knowledge bases in E with 
integrity constraint (p, is a knowledge base which implies <p. The models 

of A^(E) are characterized by imn{M od{(p) , < b ), i.e., the minimal models of tp 
with respect to the ordering <e- A^{E) is called an IC merging operator. Ac- 
cording to the semantics, it is obvious that A (if) is a special case of IC merging 
operator Ay (if). It is also shown that when E contains exactly one knowledge 
base, the operator is reduced to the ACM revision operator proposed in [1]. 
Therefore, IC merging is general enough to cover majority merging, arbitration, 
and ACM revision operator. 

To incorporate IC merging operators into epistemic logic, we will extend its 
syntax with the following formation rule: 

— if 1 ^ and are wffs, then for any nonempty G C {1,2,..., n}, [A,^(G')]?/> is 
also a wff. 

For the convenience of naming, we will call a subset of possible worlds a be- 
lief state. Let U = {C/i, C/ 2 , . . . , Afe} denote a multi-set of belief states, then 
P|CY = U\C\ ■ ■ - Uk- For the semantics, a possible world model is extended to 
(IF, (77.i)i<i<„, V, <), where < is an assignment mapping each multi-set of belief 
states U to & total pre-order <u over IF satisfying the following conditions: 

1. If IC, w' G p|C 7 , then w <u w', 

2. If IC G p|C 7 and w' ^ then w <u w', 

3. For any w G Ui, there exists w' G C/2, such that w' <{Ui,U2} where Ui 
and C /2 are two belief states, 

4. If w <u^ w' and w <u^ w', then w <UiuU2 w' , where U denotes the union of 
two multi-sets, 

5. If w <Ui w' and w <U2, w' , then w <U\uU2 w'- 

These conditions are model-theoretic correspondences of those for syncretic as- 
signments in [26,27]. Condition 1 says that possible worlds appearing in the 
belief states of all agents are equally plausible. Condition 2 asserts that a pos- 
sible worlds appearing in the belief states of all agents is more plausible than 
those not. Condition 3 requires that all agents are treated fairly. Therefore, if 
agent 1 considers w possible, then w is not more plausible than all worlds in the 
belief state of agent 2. Conditions 4 and 5 essentially require that if two groups 
of agents agree on the ordering between w and w', then the united group of these 
two groups does not reverse the ordering. 

For a group of agents G and a possible world u, let us define a total pre-order 
<Q over IF as follows: 

w <Q w' iff w <{-Ri(u)\i(^G} w'. 

The truth condition of [A^{G)]'ip is defined as that for conditional logic [10,9]. 
Formally, u \= [l\^{G)]'il} iff 
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(i) there are no possible worlds in W satisfying (p, or 

(ii) there exists Wq such that Wq \= ip and for any w <q Wq, w 1= p D ip. 

Note that in IC merging, a knowledge set consists of a multi-set of objec- 
tive sentences, whereas for the modal operator [Ac^(G)], G is a set of agents 
whose beliefs may contain subjective sentences or beliefs of other agents. Also, 
an integrity constraint in [27] must be an objective sentence, whereas p may be 
arbitrary complex wffs of our extended language. Furthermore, instead of select- 
ing minimal models of p, since the set of possible worlds may be infinite in our 
case, we adopt the system-of-spheres semantics as in Sect. 2 for the epistemic 
operator [A<^(G)]. 

5 Belief Change 

Unlike knowledge merging, where the component knowledge bases are equally 
important, belief change is a kind of asymmetry operator, where new information 
always outweighs the old. The main belief change operators are belief revision 
and update. They are characterized by different postulates [1,23,24]. In [23], a 
uniform model-theoretic framework is provided for the semantic characterization 
of the revision and update operators. In that context, a knowledge base is a 
finite set of propositional sentences, so it can also be represented by a single 
sentence(i.e., the conjunction of all sentences in the knowledge base). 

For the revision operator, it is assumed that there is a total pre-order 
over the propositional interpretations for each knowledge base ip. The revision 
operators satisfying the AGM postulates in [1] are exactly those that select 
from the models of the new information p the minimal ones with respect to the 
ordering More precisely, let ip he a, knowledge base and p denote the new 
information, then the result of revising ip by p, denoted hy ip o p, will have the 
set of models 

Mod{ip o p) = min{Mod{p), <y,). 

As for the update operator, assume for each propositional interpretation w, 
there exists some partial pre-order over the interpretations for closeness to 
w, then update operators select for each model w in Mod{ip) the set of models 
from Mod{p) that are closest to w. The updated theory is characterized by the 
union of all such models. That is, 

Mod{ipop)= min(Mo(i((/?), <uj), 

wGMod{'ip) 

where ip o p is the result of updating the knowledge base ip by p. 

Both belief revision and update may occur in the observation of new infor- 
mation p. For belief revision, it is assumed that the world is static, so if the 
new information is incompatible with the agent’s original beliefs, then the agent 
may have an incorrect belief about the world. Thus he will try to accommodate 
the new information by minimally changing his original beliefs. However, for the 
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belief update, it is assumed that the observation may be due to dynamic changes 
of the outside world, so the agent’s belief may be out-of-date, though it may be 
totally correct for the original world. Thus the agent will assume the possible 
worlds are those resulting from the minimal change of the original world. In [11], 
a generalized update model is proposed which combines aspects of both revision 
and update. It is shown that a belief update model will be inadequate without 
modelling the dynamic aspect (i.e. the events causing the update) in the same 
time. Since the dynamic change of the external worlds does not play a role in 
the belief fusion process, we will not model belief update in our logic. Therefore, 
in what follows, we will concentrate on the belief revision operator. 

Let us now consider the possibility of incorporating the belief revision op- 
erator into epistemic logic. In addition to the original meaning of revising a 
knowledge base ip by new information tp, there is an alternative reading for the 
revision operator. That is, we can consider o as a prioritized belief fusion oper- 
ator that gives priority to its second argument [22] . In the context of knowledge 
base revision, these two interpretations are essentially equivalent. However, from 
the perspective of our logic in multi-agents systems, they may be quite different. 
Roughly speaking, ioip will denote the result of revising the beliefs of agent i by 
new information (p, whereas i o j is the result of merging the beliefs of agents i 
and j by giving priority to j. More formally, a revision expression will be defined 
inductively as follows: 

— If 1 < z, J < n and is a wff, then i o j and i o (p are revision expressions. 

— If r is a revision expression, 1 < z < n and (p is & wff, then roi and r o Lp are 
revision expressions. 

The syntactic rule is extended to include the modal operators [r] for any revision 
expression r, so [r](p would be a wff if ip is. Note that a revision expression allows 
us to represent a revision sequence, which is directly related to iterated revision 
in [8,17]. 

To interpret the modal operator in our semantic framework, a possible world 
model is extended to (IT, (77.i)i<i<n, V, <), where < is an assignment mapping 
each belief state (i.e. subset of possible worlds) 17 to a total pre-order <u over 
W such that (i) if w, w' £ U, then w <u w' and (ii) if zc G 17 and w' ^ U, then 
w <u w' . Let S-U denote the sequence (17i, U^, • • • , Uk, 17) if S' = (17i, • • • , Uk) 

is a sequence of belief state, then the assignment < is extended to sequences of 
belief states in the following way (we assume <(u)=<u)' 

1. w <s-u w' if w £ U and w' ^ 17, 

2. w <s-u w' iff w <s w' when both w,w' £ U or both zc, w' ^ U. 

For each wff ip, let the truth set of (p, denoted by [(^J, be defined as {w £ W \ 
w \= p}. For each possible world zx, define a function mapping any agent i and 
revision expression r into a sequence of belief states zz(z) and zz(r) as follows: 

1. u{i) = {TZi{u)), 

2. u{r o i) = u(r) ■ 7Zi(u), 

3. u(r o p) = u{r) ■ \p\. 
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Then the truth condition for the wff [r o u \= [r o 

(i) there are no possible worlds in W satisfying (/?, or 

(ii) there exists wq € W such that wq\= Lp and for any w <u(r) u>o, w 1= (p D ip. 

Analogously, the truth condition for the wff [r o z]^/i is 

M ^ [r o i\ip iff there exists wq G TZi{u) such that for any w <u{r) wq, if 
w € 7Zi(u), then w \= -ip. 

It can be seen that [i o ip\tp is equivalent to [A^({t})]^/; in Sect. 4 according to 
the semantics. 

6 Concluding Remarks 

In preceding sections, we assume an agent’s belief states are represented as a 
subset of possible worlds, i.e. TZi{w) is the belief state of agent i in world w. 
However, some more fine-grained representations have been also proposed, such 
as total pre-orders over the set of possible worlds [8,17,28,41], ordinal conditional 
functions [11,43,44], possibility distributions [3,18,19], belief functions [42] and 
pedigreed belief states [22] . Further development of logical systems that incorpo- 
rate fusion operators based on more fine-grained representations of belief states 
should be a very interesting research direction. 

We mainly present the semantics of epistemic logics for information fusion in 
this paper. However, to do practical reasoning, we must develop proof methods 
for these logics. There have been some previous works on the development of 
axiomatic or Gentzen-style calculi for information fusion. For example, in [4,5, 
6,7], logics for information fusion based on possibility theory are proposed. The 
Hilbert-style or Gentzen-style proof systems of those logics are also presented. 
In particular, the logic PL® in [4] is an extension of QML in [29,30,31] with 
distributed belief operator, so the fusion operator in PL® is different than the 
merging operators used in this paper. The axiomatic system and theorem prover 
for a majority fusion logic MF have also been developed in [13,14]. The belief 
bases in MF are sets of literals, so it does not allow nested modalities as in our 
logics. In spite of these differences, the further development of proof theory for 
logics proposed in this paper could take these previous works as good starting 
points. 
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Abstract. In previous papers, we have presented a logic-based framework for 
merging structured news reports [ 14, 16, 1 5] . Structured news reports are XML doc- 
uments, where the text entries are restricted to individual words or simple phrases, 
such as names and domain-specific terminology, and numbers and units. We as- 
sume structured news reports do not require natural language processing. In this 
paper, we present propositional fusion rules as a way of implementing logic-based 
fusion for structured news reports. Fusion rules are a form of scripting language 
that define how structured news reports should be merged. The antecedent of a 
fusion rule is a call to investigate the information in the structured news reports 
and the background knowledge, and the consequent of a fusion rule is a formula 
specifying an action to be undertaken to form a merged report. It is expected that 
a set of fusion rules is defined for any given application. We give the syntax and 
mode of execution for fusion rules, and explain how the resulting actions give a 
merged report. 



1 Introduction 



Structured news reports are XML documents, where the text entries are restricted to 
individual words or simple phrases (such as names and domain-specific terminology), 
dates, numbers and units. We assume that strucutured news reports do not require nat- 
ural language processing. In addition, each tag provides semantic information about 
the textentries, and a structured news report is intended to have some semantic co- 
herence. To illustrate, news reports on corporate acquisitions can be represented as 
structured news reports using tags including buyer, seller, acquisition, value, 
and date. Structured news reportscan be obtained from information extraction systems 
(e.g. [8]). 

In order to merge structured news reports, we need to take account of the contents. 
Different kinds of content need to be merged in different ways as illustrated in Example 1 . 
There are many further examples we could consider, each with particular features that 
indicate how the merged report should be formed. 

Example 1. Consider the following two conflicting weather reports which are for the 
same day and same country. 



T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 502-514, 2003. 
(c) Springer- Verlag Berlin Heidelberg 2003 
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(report) 

(source) TVl (/source) 

(date) 19/3/02 (/date) 

(country) UK (/country) 

(today) showers (/today) 
(windspeed) 10 kph (/windspeed) 
(tomorrow) sun (/tomorrow) 
(regionalreport) 

(region) South East (/region) 
(maxtemp) 20C (/maxtemp) 
(/regionalreport) 

(/report) 



(report) 

(source) TV3 (/source) 

(date) 19 March 2002 (/date) 
(country) UK (/country) 

(today) inclement (/today) 
(windspeed) 15 kph (/windspeed) 
(tomorrow) rain (/tomorrow) 
(regionalreport) 

(region) North West (/region) 
(maxtemp) 18C (/maxtemp) 
(/regionalreport) 

(/report) 



We can merge them so the source is TVl auid TV3, and the weather for today is 
showers and inclement, and the weather for tomorrow is sun or rain. Also we 
may wish to take each subtree that is rooted at regionalreport in the input, and put 
them both in the merged report. 

(report) 

(source) TVl and TV3 (/source) 

(date) 19.03.02 (/date) 

(country) UK (/country) 

(today) showers and inclement (/today) 

(windspeed) 10 — 15 kph (/windspeed) 

(tomorrow) sun or rain (/tomorrow) 

(regionalreport) 

(region) South East (/region) 

(maxtemp) 20C (/maxtemp) 

(/ regionalreport) 

(regionalreport) 

(region) North West (/region) 

(maxtemp) 18C (/maxtemp) 

(/regionalreport) 

(/report) 

An alternative way of merging these reports may be possible if we have a preference for 
one source over the other. Suppose we have a preference for TVS in the case of conflict, 
then we may prefer the textentry rain for the tag tomorrow. 



In our approach to merging structured news reports, we draw on domain knowledge 
to help produce merged reports. The approach is based on fusion rules defined in a logical 
meta-language. These rules are of the form «=>/?, expressing that if a holds, then /? is 
made to hold. So we consider a as a condition to check the information in the structured 
reports and in the background information, and we consider f3 as an action to undertake 
to construct the merged report. 

To merge a set of structured news reports, we start with the background knowledge 
and the information in the news reports to be merged, and attempt to apply all the fusion 
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rules to this information. The application of the fusion rules is then a monotonic process 
that builds up a set of actions that dehne how the merged structured news report to be 
output should be constructed. 

2 Structured News Reports 

We use XML to represent structured news reports. So each structured news report is an 
XML document, but not vice versa, as dehned below. This restriction means that we can 
easily represent each structured news report by a ground term in classical logic. 

Definition 1. Strnctured news report: If4> is a tagname ( i.e an element name), and 'tp 
is a textentry, then {<j))'ip{/ (p) is a structured news report. If (p is a tagname and ai, cr„ 
are structured news reports, then ((/))cti...ct„ (/</>) is a structured news report. 

Clearly each structured news report is isomorphic to a tree with the non-leaf nodes 
being the tagnames and the leaf nodes being the textentries. 

Definition 2. Tree: If{(p)f{/(p) is a structured news report, then there is an isomorphic 
tree that has the root (p and a child ip where ip is a leaf. If {<p)ai...an{/ <p) is a structured 
news report, then there is an isomorphic tree where (1) the root is (p and (2) by recursion 
there is an isomorphic tree Pifor each Ui S {cti, .., cr„} where the root of pi is a child 
off. 

This isomorphism allows us to give a dehnition for a branch of a structured news 
report. 

Definition 3. Branch: Let a be a structured news report and let p be a tree that is 
isomorphic to a. A sequence of tagnames (pi/ ../(pn is a branch of p iff ( I) <p\ is the root 
of p and (2) for each i, if 1 < i < n, then (pi is the parent of (pi+\. Note, the child of (p„ 
is not necessarily a leaf node. By extension, (pi/ ../(pn is a branch of a iff (pi/ ../(pn is a 
branch of p 

When we refer to a subtree (of a structured news report), we mean a subtree formed 
from the tree representation of the structured news report, where the root of the subtree 
is a tagname and the leaves are textentries. We formalize this as follows. 

Definition 4. Subtree: Let a be a structured news report and let p be a tree that is 
isomorphic to a. A tree p' is a subtree of p iff (1) the set of nodes in p' is a subset of the 
set of nodes in p, and (2) for each node (pi in p' , if (pi is the parent of (pj in p, then (pj is 
in p' and (pi is the parent of(pj in p' . By extension, if a' is a structured news report, and 
p' is isomorphic to a' , then we describe a' as a subtree of a. 

Each structured news report is also isomorphic with a ground term (of classical logic) 
where each tagname is a function symbol and each textentry is a constant symbol. 

Definition 5. News term: Each structured news report is isomorphic with a ground 
term (of classical logic) called a news term. This isomorphism is defined inductively as 
follows: (I) If {(p)ip{/ (p) is a structured news report, where ip is a textentry, then (p{ip) is 
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a news term that is isomorphic with {(f>)tp{/(j}); and (2) If ^ structured 

news report, and is a news term that is isomorphic with ipi, and is a news 
term that is isomorphic with t/>„, then is a news term that is isomorphic 

with 

Via this isomorphic relationship, we can refer to a branch of a news term by using 
the branch of the isomorphic structured news report, and we can refer to a subtree of a 
news term by using the subtree of the isomorphic structured news report. 

Definition 6. Let a be a structured news report and let n be a news term that is iso- 
morphic to a. By extending Definition 3, fij ..jfn is a branch of it ijf fi / . . / fn is a 
branch of a. Let a' be a structured news report and let tt' be a news term such that it' is 
isomorphic to a'. By extending Definition 4, tt' is a subtree of tt iff a' is a subtree of a. 

We now define two functions that allow us to obtain subtrees and textentries from 
news terms. 

Definition 7. Let tt be a news term, let tt' be a subtree of tt, and let f be a textentry. If 
4>il.-l<pn is a branch of TT, and the root of tt' is fin, /et Subtree((/)i/. ./(/)„, tt) = tt' , 
otherwise letSuhtree{(j)i/ ../fin, Tt) = null. Iffi/../fn is a branch of tt, and f is the child 
of fin, /et Textentry tt) = f, otherwise Textentry{(l)i/ ../fin, tt) = null. 



Example 2. Consider the following structured news report. 

(auctionreport) 

(buyer) (f irstname) John(/f ir stname) (surnaune) Smith(/surname) (/buyer) 
(property) Lot37 (/ property) 

(/auctionreport) 

This can be represented by the following news term: 

auct ionreport (buyer (f irstname (John), surname (Smith) ), property(Lot37)) 

In this news term, auctionreport/buyer/f irstname is a branch. If the news term is 
denoted by tt, we have 

Subtree(auctionreport /buyer, tt) — buyer(f irstname(John), surname(Smith)) 
Subtree(auctionreport/buyer/f irstname, tt) = f irstname(John) 
Textentry(auctionreport/buyer/f irstname, tt) = John 



Definition 8. A skeleton is of the form .., r/n) where f is a tagname andfi, ..,ipn 

are skeletons. Iffii is a tagname, then it is a skeleton. A skeleton 4>{rpi , .., r/n) can be 
regarded as a tree. 

A skeleton is a equivalent to a structured news report without text entries. It is the 
underlying structure without the content. 
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3 Fusion Rules 

In this paper, we restrict consideration to merging sets of structured news reports of a 
fixed cardinality. So a set of fusion rules is specified for an application to take exactly n 
structured news reports as input to produce a merged report as output. 

Definition 9. We assume an arbitrary naming of the input reports with names from a 
set of report names {ini, inn}- The set o/ report names is denoted M. 



Definition 10. Let fij ..j be a branch, and let p € Af be a report name. A subtree 
variable is denoted pj ...j fn, and a textentry variable is denoted pj jfij 

A schema variable is either a subtree variable or a textentry variable. Let the set of 
schema variables be denoted S. 

In the following definition, we augment a definition for a classical logic language 
(where function symbols are not nested) with notation for schema variables which are 
just placeholders to be instantiated with news terms before logical reasoning. 

Definition 11. Let C be a set of constant symbols, let S be a set of schema variables, 
and T be a set of function symbols. The set of ground terms Q is C\J |/(ci, .., Ck) \ f G 
T and Ci, .., Cfc G Cj. The set of terms T is CU |/(c?i, .., df) | / G JF and di, .., dk G 
C U 5}. Let V be a set of predicate symbols. The set of atoms A is |p(fi, ..,tk) \ P G 
V and ti , .., tk G Tj. The set of literals L is A\J {-■7 | 7 G A}. The set of ground 
atoms B is {p{gi , .., gk) \ p € V and gi, ..,gk G G}- The set of ground literals AA is 
S U (-7 I 7 G B}. 



Definition 12. Let ..fn) be a news term. The subterms of this news term are 
given by the function Subterms as follows. 

Subterms((}'('!/)i, .., f>n)) = {fifi, fn)} U Subterms(^/;i) U .. U Subterms('i/;„) 

For a set of news terms <P, Zet Subterms(tP) = lj 0 (^i Subterms(^(i/'i, V’n)) 

We assume that for all possible news terms tt, we have Subterms(Tr) C G. We 
also assume that if (/>i /../(/>„ is a branch, then it is a constant symbol, called a branch 
constant, and it is in G- Similarly, if is a skeleton, then it is a constant 

symbol, called a skeleton constant, and it is in G- 

Definition 13. A propositional fusion rule wo/t/re/o/Zovv/ng/ormw/iereai, G £ 
and [3 G A. 

A .. A an P 

We call ai, .., q;„ the condition literals and /3 the action atom. 



We regard a fusion rule that incorporates schema variables, as a scheme for one 
or more classical propositional formulae. These propositional formulae are obtained 
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by grounding schema variables as we explain below. We discuss condition literals in 
Sect. 3.1 and action literals in Sect. 3.2. 

Example 3. The following is a propositional fusion rule for Example 1. 

-•Synonymous ( iui //report /t oday/fl: , iu2 / /report / 1 oday:/^) 

A Coherent (ini //report /today#, in2//report/today#) 

=> AddT ext (Con j unct ion( ini //report /f oday# , 

in2//report /today#) , report /today) 



Definition 14. Let <P be a set of structured news reports to be merged. Let A be an as- 
signment ( bijection) from the report names ff to <P.t is a valid grounding for a subtree 
variable pj Ifij ...jfk ifff G Subterms(A(/r)) ant/ Subtree(^i/.../(/fc, A(p)) = t. t 
is a valid grounding for a textentry variable pJ/fi/.-./fk# iffT G Subterms(A(/r)) 
and Textentry((/i/.../(/)fc, A(p)) = t. 



Example 4. Consider the rule in Example 3. Eor iui //report /today#, the valid 
grounding is Textentry(report/today, A(iui)). This is evaluated to showers if we 
let A(ini) refer to the top left structured news report in Example 1. Similarly, the valid 
grounding for variable iu 2 / /report /today# is Textentry (report/today, A(iu 2 )), 
which is evaluated to inclement if we let A(iu 2 ) refer to the top right structured news 
report in Example 1 . 



Definition 15. A ground fusion rule is a propositional fusion rule with every schema 
variable replaced by a valid grounding. Let Ground((5, <P) be the set of all ground fusion 
rules formed from the propositional fusion rule S where each schema variable in S is 
systematically replaced by a valid grounding from Subterms(^). 



Example 5. The ground fusion rule obtained with the fusion rule given in Example 3 
with the news reports in Example 1 is the following: 

-•Synonymous(showers, inclement) 

A Coherent(showers, inclement) 

=> AddText(Conjunction(showers, inclement), report/today) 



Proposition 1. Ify is a ground fusion rule, then 7 is a formula of propositional classical 
logic. 

This result means that we have a clear and simple characterization of propositional 
fusion rules as schema for classical propositional formulae such that once we have the 
grounded versions of them, reasoning with them is straightforward. As we discuss next, 
fusion rules provide a bridge between structured news reports and logical reasoning with 
background knowledge. 
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3.1 Condition Literals 

The condition literals in fusion rules relate the contents of structured news reports to 
the background knowledge. There are many possible condition literals that we could 
define that relate one or more features from one or more structured news reports to the 
background knowledge. To illustrate, these literals may include the following kinds: 
ScmieDate(T, T') where T and T' are news terms with equal date; SameSource(T, T') 
where T and T' are news terms that refer to the same source; SameCity(T, T') where T 
and T' are news terms that refer to the same city; Synonymous(T, T') where T and T' are 
news terms that are synonyms; and Coherent(T, T') where T and T' are news terms that 
are coherent. 

Example 6. Some examples of condition literals (when ground) may include the fol- 
lowing. 



SameDate(date(14Nov01), date( 14. 11.01)) 
SameDate(date(day(14), month.(ll), year(Ol)), date( 14. 11.01)) 
SameCity(city(Mumbai), city (Bombay)) 

Coherent(snow, sleet) 

Coherent(sun, sunny) 

Coherent(showers, inclement) 

-•Coherent (sun, rain) 

-•Coherent (sun, snow) 

-•Synonymous (showers, rain) 

The condition literals are evaluated by querying background knowledge. In the sim- 
plest case, the background knowledge may be just a set of ground atoms that hold. 
However, we would expect the background knowledge would include classical quan- 
tified formulae that can be handled using automated reasoning. In any case, the back- 
ground knowledge is defined by a knowledge engineer building a fusion system for an 
application. 



3.2 Action Atoms 

Action atoms specify the structure and content for a merged report. In a ground fusion 
rule a' /3', if the ground literals in the antecedent a' hold, then the merged report 
should meet the specification represented by the ground atom /?'. We look at this in more 
detail in the next section. Here we consider the syntax for action literals. 

Each action atom is a member of A. These incorporate terms based on action func- 
tions that take one or more news terms as arguments and return a news term. There 
are many possiblities for action functions including the following where X and Y 
are grounded with textentries: Interval(X, Y) returns an interval X — Y as a texten- 
try; Conjunction(X, Y) returns a textentry X cuid Y; and Disjunction(X, Y) returns a 
textentry X or Y. We assume action functions are interpreted in the underlying imple- 
mentation and return the appropriate textentries for evaluating the action atom. 
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Example 7. The following ground function, on the left of the = symbol, are rewritten to 
the news terms on the right: 

Interval(18C, 25C) = 18 - 25C 
Conjunction(TVl,TV3) = TVl and TVS 
Disjunction(sun, rain) = sun or rain 

We now define a basic set of action atoms. A number of further definitions for action 
atoms are possible. 

Definition 16. The action atoms are literals that include the following specifying how 
the merged report should be constructed. 

1. Initialize((/)(t/)i, ,.,'ipf)) where is a skeleton constant. Theintended 

action is to start the construction of the merged structured news report with the basic 
structure being defined by ■ The root of the merged report is f. 

2. AddText(T, ..j ff) where T is a textentry, and fij ..jfn is a branch constant. 

The intended action is to add the textentry T as the child to the tagname </>„ in the 
merged report on the branch fij ..j fin- 

3. AddTree(T, ..j ff) where T is a news term, and fij ..Ifn is a branch constant. 

The intended action is to add T to the merged report so that the tagname for the root 
of 7 has the parent fin on the branch fil ..Ifn- 

The action atoms are specifications that are intended to be made to hold by producing 
a merged report that satisfies the specification. 

Example 8. Consider the action literal in the consequent of Example 3. 

AddText (Conjunction) showers, inclement), report /today) 

The term Conjunction(showers, inclement) is rewritten by the system to the term 
showers and inclement, and so the action literal is now the following: 

AddText(showers euid inclement, report /today) 

This specifies that the textentry should be showers and inclement in the merged report 
for tagname today on the branch report /today, as obtained in Example 1 

4 Rule Execution 

In order to use a set of fusion rules, we need to be able to execute them with background 
knowledge and a set of structured news reports. 

Definition 17. A fusion call is a triple {A, T, d>) where T is a set of fusion rules, A is 
a background knowledgebase (a set of classical first-order formulae), and T> is a set of 
structured news reports. 

To merge some reports, the fusion rules are ground with the structured news reports, 
and then a form of modus ponens is exhaustively applied together with the background 
knowledge. 
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Definition 18. Let {A, F, (F) be a fusion call. 

Act\ons{A, r,(p) = {P' \ a' ^ P' G Ground(5, <?) and 6 G F and A h a'} 
where h is the classical consequence relation. 



Example 9. A fusion call with an appropriate set of fusion rules and the pair of news 
reports given in the top of Example 1 together with appropriate background knowledge 
can give the following action atoms: 

Initialize(report(source, date, country, today, windspeed, tomorrow)) 
AddText(Conjunction(TVl, TVS), report/source) 

AddText(19.03.02, report/date) 

AddText(UK, report /country) 

AddText(Conjunction(showers, inclement), report/today) 
AddText(lnterval(18Kph, 15Kph), report /windspeed) 
AddText(Disjunction(sun, rain), report/tomorrow) 
AddTree(regionalreport(region(SouthEast),maxtemp(20C)), report) 
AddTree(regionalreport(region(NorthWest),maxtemp(18C)), report) 

These action atoms specify the merged report given in the bottom of Example 1 . 



Proposition 2. Let (Z\, T, F) be a fusion call. Actions(Z\, T, F) is a finite set iff F is a 
finite set of fusion rules. 

Given a fusion call (Z\, F, F), the set Actions(Z\, F, F) is input to an algorithm for 
constructing a merged report. The minimum we expect of a set of action atoms is that a 
merged report can be produced that meets the specification. We formalize this as follows. 

Definition 19. Let (A, F, F) be a fusion call and let a be a structured news report. 

(7 meets Actions(Z\, F, F) iff \/p G Actions(Z\, F, F) a meets P 

where a is isomorphic with a news term tt such that 

a meets lnitialize((/('0i, .., f/n)) 

iff each branch ofa{tfi , .., tpn) is a branch of tt 

a meets AddText(T, p\j ..j pn) 

#Textentry((/i/../^„,7r) = T 

cr meets AddTree(T, p\j ..j pF) 

jj^Subtree((/i/..//'„//'„+i, 7 t ) = T and the root ofT is pn+i 

However, the meets relation is a little too relaxed in the sense that a report may meet 
an action sequence but may also include extra information that has not been specified. 
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Definition 20. The matches relation is defined as follows where a is a structured 
news report and {A,r,<P) is a fusion call: a matches Actions(Z\, iff a meets 
Actions(Z\, r, fi) and there is no a' s.t. {a' meets Actions(Z\, F, <P) an<iSubterms((T') C 
Subterms(CT)). 

The matches relation identifies the minimal structured news report(s) that meet(s) 
the action sequence. In other words, it identifies the news reports that do not include any 
superfluous information. 

Definition 21. A fusion call (Z\, F, <P) is complete iff there is a fforwhich lnitialize(t/;) 
atom is in Actions(Z\, F, <P) and for all branches fij .ffn off there is an action atom 
AddText(T, fij ..j fin) in Actions(Z\, T, F) for some T. 

In other words, an action sequence is complete if it is not the case that the structured 
news report that results has missing textentries. 

Definition 22. A fusion call (A,F,<P) is consistent iff (F) there is exactly one 
lnitialize(r/>) atom in Actions(Z\, T, ^); and (2) if AddText(T, ./(/)„) is in 
Actions(Z\, T, ^), then for all AddText{T' , fi/ ../fin) in Actions(Z\, T, T = T'; 



Proposition 3. A fusion call {A, F,<F) is complete and consistent iff there is a structured 
news report tt such that tt meets Actions(Z\, F, <F). 



Definition 23. A fusion system {A, F) is well-behaved iff for all F either {A, F,d>) is a 
complete and consistent fusion call or there is no f such that the Initialize('i/)) atom 
is in Actions(Z\, F, <T). 

This means the fusion rules in a fusion system need to be engineered so that exactly 
one Initialize atom is obtained for any set of structured news reports that are to be 
covered by the fusion system, and no Initialize atom is to be obtained for any set of 
structured news reports F that are not to be covered by the fusion system. 

The set of action atoms that we have defined in this paper is only part of the range of 
possible action atoms. We have implemented others including ExtendTree(T, fif ■■ Iff) 
where T is a news term, and fi/ ../fn is a branch constant, and the intended action is 
to extend the merged report with T so that the tagname for the root of T is fn on the 
branch fij .-Ifn- We intend to extend the range of implemented functions to include 
functions based on voting strategies so that for example if the majority of news reports 
input have a particular textentry on a particular branch, then the merged report will have 
that textentry on that branch. 

5 Discussion 

The definition for a fusion call suggests an implementation based on existing automated 
reasoning technology and on XML programming technology. Once information is in 
the form of XML documents, a number of technologies for managing and manipulating 
information in XML are available. We have developed a prototype implementation in 
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Java for executing executing fusion rules that are marked up in FusionRuleML and 
constructing the merged reports [17]. Background knowledge is handled in a Prolog 
system and this is queried by the Java implementation. 

Our logic -based approach differs from other logic-based approaches for handling 
inconsistent information such as belief revision theory (e.g. [11,9,18,20]) and knowl- 
edgebase merging (e.g. [19,1]). These proposals are too simplistic in certain respects 
for handling news reports. Each of them has one or more of the following weaknesses: 
(1) One-dimensional preference ordering over sources of information — for news re- 
ports we require hner-grained preference orderings; (2) Primacy of updates in belief 
revision — for news reports, the newest reports are not necessarily the best reports; 
and (3) Weak merging based on a meet operator — this causes unnecessary loss of in- 
formation. Furthermore, none of these proposals incorporate actions on inconsistency 
or context-dependent rules specifying the information that is to be incorporated in the 
merged information, nor do they offer a route for specifying how merged reports should 
be composed. 

Merging information is also an important topic in database systems. A number of 
proposals have been made for approaches based in schema integration (e.g. [24]), the 
use of global schema (e.g. [12]), and conceptual modelling for information integration 
based on description logics [4,3,10,23,2]. These differ from our approach in that they 
do not seek an automated approach that uses domain knowledge for identifying and 
acting on inconsistencies. Heterogeneous and federated database systems are relevant, 
but they do not identify and act on inconsistency in a context-sensitive way [26,22,6], 
though there is increasing interest in bringing domain knowledge into the process (e.g. 
[5,27]). Also relevant is revision programming, a logic-based framework for describing 
and enforcing database constratints [21]. 

Our approach also goes beyond other technologies for handling news reports. The 
approach of wrappers offers a practical way of dehning how heterogeneous information 
can be merged (see for example [13,7,25]). However, there is little consideration of 
problems of conflicts arising between sources. Our approach therefore goes beyond 
these in terms of formalizing reasoning with inconsistent information and using this to 
analyse the nature of the news report and for formalizing how we can act on inconsistency. 



Acknowledgements. The authors wish to thank Weiru Liu and the referees for helpful 
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References 

1. C. Baral, S. Kraus, J. Minker, and V. Subrahmanian. Combining knowledgebases consisting 
of first-order theories. Computational Intelligence, 8:45-71, 1992. 

2. S. Bergamaschi, S. Castano, M. Vincini, and D. Beneventano. Semantic integration of het- 
erogeneous information sources. Data and Knowledge Engineering, 36:215-249, 2001. 

3. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Description logic 
framework for information integration. In Proceedings of the 6th Conference on the Principles 
of Knowledge Representation and Reasoning (KR’ 98), pages 2-13. Morgan Kaufmann, 1998. 




Propositional Fusion Rules 



513 



4. D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Source integration in 
data warehousing. In Proceedings of the 9th International Workshop on Database and Expert 
Systems (DEXA’98), pages 192-197. IEEE Computer Society Press, 1998. 

5. L. Cholvy. Reasoning with data provided by federated databases. Journal of Intelligent 
Information Systems, 10:49-80, 1998. 

6. L. Cholvy and S. Moral. Merging databases: Problems and examples. International Journal 
of Intelligent Systems, 10:1193-1221, 2001. 

7. W. Cohen. A web-based information system that reasons with structured collections of text. 
In Proceedings of Autonomous Agents ’98, 1998. 

8. J. Cowie and W. Lehnert. Information extraction. Communications of the ACM, 39:^1-91, 
1996. 

9. D. Dubois and H. Prade, editors. Handbook of Defeasible Resoning and Uncertainty Man- 
agement Systems, volume 3. Kluwer, 1998. 

10. E. Franconi and U. Sattler. A data warehouse conceptual data model for multidimensional 
aggregation. In S. Gatziu, M. Jeusfeld, M. Staudt, and Y. Vassiliou, editors. Proceedings of 
the Workshop in Design and Management of Data Warehouses, 1999. 

11. P. Gardenfors. Knowledge in Flux. MIT Press, 1988. 

12. G. Grahne and A. Mendelzon. Tableau techniques for querying information sources through 
global schemas. In Proceedings of the 7th International Conference on Database Theory 
(ICDT’99), Lecture Notes in Computer Science. Springer, 1999. 

13. J. Hammer, H. Garcia-Molina, S. Nestorov, and R. Yemeni. Template-based wrappers in the 
TSIMMIS system. In Proceedings of ACM SIGMOD’97. ACM, 1997. 

14. A. Hunter. Merging potentially inconsistent items of structured text. Data and Knowledge 
Engineering, 34:305-332, 2000. 

15. A. Hunter. Logical fusion rules for merging stmctured news reports. Data and Knowledge 
Engineering, 42:23-56, 2002. 

16. A. Hunter. Merging structured text using temporal knowledge. Data and Knowledge Engi- 
neering, 41:29-66, 2002. 

17. A. Hunter and R. Summerton. FusionRuleML: Representing and executing fusion mles. 
Technical report, UCL Department of Computer Science, 2002. 

18. H. Katsuno and A. Mendelzon. On the difference between updating a knowledgebase and 
revising it. In Principles of Knowledge Representation and Reasoning: Proceedings of the 
Second International Conference (KR’91), pages 387-394. Morgan Kaufmann, 1991. 

19. S. Konieczny and R. Pino Perez. On the logic of merging. In Proceedings of the Sixth 
International Conference on Principles of Knowledge Representation and Reasoning ( KR ’98), 
pages 488-498. Morgan Kaufmann, 1998. 

20. P. Liberatore and M. Schaerf. Arbitration (or how to merge knowledgebases). IEEE Trans- 
actions on Knowledge and Data Engineering, 10:76-90, 1998. 

21. V. Marek and M. Truszczynski. Revision programming. Theoretical Computer Science, 
190:241-277, 1998. 

22. A. Motro. Cooperative database systems. International Journal of Intelligent Systems, 
11:717-732, 1996. 

23. N. Paton, R. Stevens, P. Baker, C. Goble, S. Bechhofer, and A. Brass. Query processing in the 
TAMBIS bioinformatics source integration system. In Proceedings of the Ilth International 
Conference on Scientific and Statistical Databases, 1999. 

24. A. Poulovassilis and P. McBrien. A general formal framework for schema transformation. 
Data and Knowledge Engineering, 28:47-71, 1998. 

25. A. SahuguetandF. Azavant. Building light-weight wrappers for legacy web data-sources using 
W4F. In Proceedings of the International Conference on Very Large Databases (VLDB’99), 
1999. 




514 



A. Hunter and R. Summerton 



26. A. Sheth and J. Larson. Federated database systems for managing distributed, heterogeneous, 
and autonomous databases. ACM Computing Surveys, 22:183-236, 1990. 

27. K. Smith and L. Obrst. Unpacking the semantics of source and usage to perform semantic 
reconciliation in large-scale information systems. In ACM SIGMOD RECORD, volume 28, 
pages 26-31, 1999. 




Preferential Logics for Reasoning with Graded 

Uncertainty 



Ofer Arieli 

Department of Computer Science, The Academic College of Tel-Aviv 
4 Antokolski street, Tel-Aviv 61161, Israel 
oarieliSmta. ac . il 



Abstract. We introduce a family of preferential logics that are useful for 
handling information with different levels of uncertainty. The correspond- 
ing consequence relations are non-monotonic, paraconsistent, adaptive, 
and rational. It is also shown that any formalism in this family that is 
based on a well-founded ordering of the different types of uncertainty, 
can be embedded in a corresponding four-valued logic with at most three 
uncertainty levels. 



1 Motivation 

The ability to reason in a ‘rational’ way with incomplete or inconsistent informa- 
tion is a major challenge, and its significance should be obvious. It is well-known 
that classical logic is not suitable for this task, thus non-classical formalisms are 
usually used for handling uncertainty.^ Such formalisms should be able, more- 
over, to distinguish among different types of uncertainty in the underlying data, 
since each kind of uncertainty may require a different treatment. The following 
example demonstrates such a case: 

Example 1. Let P = {p f— true , -<p true , q -s— not -r , ~'q not r}. This 
is a ‘prolog-like’ program, with two kinds of negation operators: one, intu- 
itively represents explicit negation, and the other, not, represents implicit nega- 
tive information, and may be intuitively understood as a ‘negation-as-failure’ (to 
prove or verify the corresponding assertion on the basis of the available informa- 
tion). The meaning of the last two clauses of V is, therefore, that q (respectively, 
-<q) holds provided that ~<r (respectively, r) cannot be verified. 

The theory above depicts several types of uncertainty: the information about 
r is incomplete, since r does not appear in a head of any clause in V, and so 
no explicit data about it (nor about its negation) is available. This implies, in 
particular, that one cannot conclude that either r or ->r holds, and so, by the last 
two clauses, the data about q is inconsistent. Clearly, by the first two clauses, 
the information about p is inconsistent as well. Note, however, that there is 
a difference between the inconsistent information about p and about q\ while 
the contradiction regarding p is based on explicit data, the evidence about q is 

^ See, e.g., [12,15,18,23] for some recent collections of papers on this topic. 
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less ‘stable’, since it relies on the (possibly temporary) fact that neither r nor 
->r holds. In particular, once r is validated or falsified, the information about 
q would not remain contradictory anymore! One may also argue that although 
the information about r is incomplete, there is still more knowledge about r 
(e.g., that it determines the validity of q) than about, say, s (about which we 
don’t know anything whatsoever). Here, again, we have two different degrees of 
uncertainty. 

The example above demonstrates one case in which it is natural to attach 
different levels of uncertainty to different assertions. This kind of information 
may be used, for instance, by algorithms for consistency restoration, since data 
with higher degree of inconsistency may be treated (i.e., eliminated) first. 

In this paper we consider a framework that supports this type of considera- 
tions, and provides means to reason with different levels of uncertain information. 
We show that the logics that are obtained are nonmonotonic, paraconsistent [19], 
adaptive in the sense of Batens [10,11], and rational in the sense of Lehmann 
and Magidor [28] . It is also shown that under a certain assumption on the grad- 
ing relations, for each one of these formalisms there is a logically equivalent 
four-valued logic with at most three different levels of uncertainty. 



2 The Framework 

2.1 Logical Lattices and Their Consequence Relations 

In order to overcome the shortcomings of classical logic in properly handling 
uncertainty, we turn to multiple-valued logics. This is a common approach that 
is the basis of many formal systems (see [9] for a recent survey), including systems 
that are based on fuzzy logic [22] , probabilistic reasoning [33] , possibilistic logics 
[21], annotated logics [26,37], and fixpoint semantics for extended/disjunctive 
logic programs (see, e.g., [3,29], and a survey in [20]). 

In most of the approaches mentioned above, as in the present one, the truth- 
values are arranged in a lattice structure. In what follows we denote hy C = (L, <) 
a bounded lattice that has at least four elements: a <-maximal element and a <- 
minimal element that correspond to the classical values (denoted, respectively, 
by t and /), and two intermediate elements (denoted by T and T) that may 
intuitively be understood as representing the two basic types of uncertainty: 
inconsistency and incompleteness (respectively). As usual, the meet and the join 
operations on C are denoted by A and V. In addition, we assume that C has an 
involution operator -• (a “negation”) s.t. ~<t = f, -•f = t, -iT = T, -iT = T. We 
denote by T> the set of the designated values of L (i.e., the set of the truth values 
in L that represent true assertions) . We shall assume that 27 is a prime filter in 
s.t. T g 27 and ±^T>. The pair (£,27) is called a logical lattice [6]. 

In particular, tG27 and f^T>. 



2 
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Fig. 1. TOUn and 



Example 2 . The smallest logical lattice is shown in Fig. 1 (left). We denote it by 
EOWZ. This lattice, together with the set T> = {t,T} of designated values, is 
the algebraic structure behind Belnap’s well-known four-valued logic [13,14], and 
it will play an important role here as well (see Sect. 3). NXMS (Fig. 1, right), 
may be viewed as an extended version of TOUTZ for default reasoning (dt = 
true by default, bt = ‘biased’ for t, etc.). This lattice depicts three main levels 
of uncertainty: incomplete data (T), inconsistent data (T), and a middle level 
of uncertainty (to). The latter kind of uncertainty sometimes follows from con- 
tradictory default assumptions, so it may be retracted when further information 
arrives. The decision whether to view to as designated is one of the differences 
between the two logical lattices that NXM£ induces, namely {AfXAfS, {t, bt,T}) 
and {MXM£ , {t , dt, , 6 /, to, T}) . 

The set U = {{x,y) \ 0<x<l, 0<y<l} with (a;i,j/i) V {x2,y2) = 
(max(xi,a; 2 ),min(yi, j/ 2 )) and {xi,yi) A {x 2 ,y 2 ) = (min(a;i, X 2 ), max(j/i, 7 / 2 )) is 
an infinite lattice, and (W, {(l,x) | 0<x<l}) is a logical lattice with t = (1,0), 
/ = (0,1), T = (1,1), and T = (0,0). One way to intuitively understand the 
meaning of an element (x,y) GU is such that x represents the amount of belief 
for the underlying assertion, and y represents the amount of belief against it. 
Following this intuition, every element (x, x) G lA may be associated with a 
different degree of inconsistency. 

Given a logical lattice (£,2?), the standard semantical notions are natural 
generalizations of the classical ones: a (multiple- valued) valuation v is & function 
that assigns an element of L to each atomic formula. The set of valuations onto 
L is denoted by . Extension to complex formulae is done in the usual way. A 
valuation is a model of a set of assertions X if it assigns a designated value to 
every formula in E. The set of all the models of E is denoted by mod{E). 

The language considered here is a propositional one. Note that there are no 
tautologies in the language of {-i, V, A}, since if all the atomic formulae that 
appear in a formula ip are assigned T by a valuation u, then = T as well. 
It follows that the definition of the material implication p^q as -'pW q is not 
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adequate for representing entailments in our semantics. Instead, we use another 
connective, which does function as an implication in our setting: 

Definition 1. [4,8] Let {C,V) be a logical lattice. Define: x^y = y if x&V, 
and x^y = t otherwise.^ 

The language of {-•, V, A, — >■} together with the propositional constants t, /, T 
and _L, will be denoted by S. Given a set of formulae T in we shall denote 
by A{r) the set of the atomic formulae that appear in some formula of F. 

Now, a natural definition of a lattice-based consequence relation is the fol- 
lowing: 

Definition 2. Let {C,T>) be a logical lattice, F a set of formulae, and t/’ a 
formula. Denote F ^ jf every model of T is a model of ip. 

The relation q£ Definition 2 is a consequence relation in the standard 

sense of Tarski [38]. In [4] it is shown that this relation is monotonic, com- 
pact, paraconsistent [19], and has a corresponding sound and complete cut-free 
Gentzen-type system. The major drawbacks of are that it is strictly weaker 
than classical logic even for consistent theories (e.g., ip -i(/)V </>), and that it 
always invalidates some intuitively justified inference rules, like the Disjunctive 
Syllogism (that is, ip, ->ip V 4> (py In the next section we consider a family 

of logics that overcome these drawbacks. 



2.2 Preferential Reasoning and the Consequence Relation 

In order to recapture within our many-valued framework classical reasoning 
(where its use is appropriate), as well as standard non-monotonic and para- 
consistent methods, we incorporate a concept first introduced by McGarthy [32] 
and later considered by Shoham [36] , according to which inferences from a given 
theory are made w.r.t. a subset of the models of that theory (and not w.r.t. every 
model of the theory; see also [24,27,30,31,35]). This set of preferential models is 
determined according to some conditions that can be specified by a set of (usu- 
ally second-order) propositions [7], or by some order relation on the models of 
the theory [4,5]. This relation should reflect some kind of preference criterion on 
the models of the set of premises. In our case the idea is to give precedence to 
those valuations that minimize the amount of uncertain information in the set 
of premises. The truth values are therefore arranged according to an order rela- 
tion that reflects differences in the amount of uncertainty that each one of them 
exhibits. Then we choose those valuations that minimize the amount of uncer- 
tainty w.r.t. this order. The intuition behind this approach is that incomplete or 
contradictory data corresponds to inadequate information about the real world, 
and therefore it should be minimized. Next we formalize this idea. 

^ Note that on {t, /} the material implication (-^) and the new implication (— >■) are 
identical, and both of them are generalizations of the classical implication. 
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Definition 3. A partial order < on a set L is called modular if y<X 2 for every 
Xi, X 2 , y€L s.t. Xi^X 2 , X 2 ftxi, and y<Xi. 



Proposition 1. [28] Let < be a partial order on L. The following conditions 
are equivalent: 

a) < is modular. 

b) For every Xi,X2,y€L, if X\<X 2 then either y < X 2 or Xi<y. 

c) There is a totally ordered set L' with a strict order < and a function g-.L^U 
s.t. xi<X 2 iff g{xi)<g{x 2 ). 



Definition 4. An inconsistency order on a logical lattice {C,V) is a well- 
founded modular order on L, with the following properties: 

a) t and / are minimal and T is maximal w.r.t. <f’^, 

b) if {x, -ixjCI? while {y,^y}%V, then xfi^’^y, 

c) X and —'X are either equal or ^-incomparable. 

Inconsistency orders are used here for grading uncertainty in general, and 
inconsistency in particular. Intuitively, the meaning of x<^'^y is that formulae 
that are assigned x are more definite than formulae with a truth value y. Modu- 
larity is needed for assuring a proper grading of the truth values.'^ Condition (b) 
in Definition 4 assures that truth values that intuitively represent inconsistent 
data will not be considered as more consistent than those ones that correspond 
to consistent data. The last condition makes sure that any truth value and its 
negation have the same degree of (in)consistency. 



Example 3. TOUTi. has four inconsistency orders: 

a) The degenerated order, in which t, /, _L,T are all incomparable. 

b) in which T is considered as minimally inconsistent: {t, /, _L} T. 

c) , in which T is maximally inconsistent: {t, /} {T, _L}. 

d) <’^, in which _L is an intermediate level of inconsistency: {t, /} T T. 



In the rest of the paper we shall continue to use the notations of Example 3 
for denoting the inconsistency orders in TOUTZ. 

Given an inconsistency order on a logical lattice (£,T’), it induces an 

equivalence relation on L, in which two elements in L are equivalent iff they are 
equal or ^-incomparable. For every x&C, we denote by [x] the equivalence 
class of X with respect to this equivalence relation. I.e., 

\x] = {y I y = x, or x and y are ^-incomparable}. 



The order relation on these classes is defined as usual by representatives: we 
denote [x] [j/] iff either x<^’^ y, or x and y are ^-incomparable.® It is 

That is, to eliminate orders such as{{t}, {/^T^T}}, in which T and T are not 
comparable with t, while they are comparable with -it. 

® As usual, we use the same notation to denote the order relation among equivalence 
classes and the order relation among their elements. 
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easy to verify that this definition is proper, i.e. it does not depend on the choice 
of the representatives. 

An inconsistency order on (Cj'D) induces the following pre-order on V^: 

Definition 5. Let be an inconsistency order on (£, V), and let G 

a) I'l <c’^ ^2 iff for every atom p, [i'i{p)] [v2{p)]. 

b) vi U 2 if vi <c'^ ^2 and there is an atom q s.t. [j^i(9)] <c'^ [^ 2 (<z)]- 

Definition 6. Let be an inconsistency order in a logical lattice (£, V). The 
set of the c-most consistent models of a set T of formulae in S (abbreviation: 
the c-mcms of T) are the minimal inconsistent models of F, i.e.: 

!(T,<f’^) = {v£mod{r) I -<3fj,£mod{r) s.t. v}. 

Now we can refine the inference process, defined by the lattice-based conse- 
quence relation (Definition 2). Instead of considering every possible model 

of the premises, we take into account only the c-most consistent ones. 

Definition 7. Let be an inconsistency order on a logical lattice {C,T>). 

Denote: F |=f f/' if every c-mcm of T is a model of ')/'• 



Example 4- Consider one direction of the barber paradox:® 

F = {-'shaves(x, x) shaves(Barber, x)}. 

Denote by vi, V2, and the valuations that assign t, _L, and T (respectively) to 
the assertion shaves(Barber, Barber). Using TOUTi. as the underlying logical 
lattice, we have that !(U, <^^) = !(U, <^,^) = {vi}, !(U, = {vi,v 2 }, and 

!(C, <cq) = {v\,V 2 tV 3 } ■ Thus, F shaves(Barber, Barber) when t = 0, 1, 
while F\=^. shaves(Barber, Barber) when i = 2,3. 

3 Embedding in Four- Valued Logics 

Four-valued reasoning may be traced back to the 1950’s, where is has been in- 
vestigated by a number of people, including Bialynicki-Birula [16], Rasiowa [17], 
and Kalman [25]. Later, Belnap [13,14] introduced a corresponding four-valued 
algebraic structure (denoted here by TOUTZ) for paraconsistent reasoning. Theo- 
rem 1 below, which is our main result here, shows that this structure is canonical 
for reasoning with graded uncertainty. Following [5], this is another evidence for 
the robustness of four- valued logics as representing commonsense reasoning. 

® Here we assume that formulae with variables are universally quantified. Conse- 
quently, a set of assertions F, containing a non-grounded formula, V’l is viewed as 
representing the corresponding set of ground formulae, formed by substituting for 
each variable that appears in ip, every element in the relevant Herbrand universe. 
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Definition 8. is stoppered w.r.t. if for every F and every i/Gmod{r), 

either !(-T, or there is an F G !(f^, <f’^) s.t. v' 

Note that in case that is well-founded w.r.t. (i.e., does not have 
an infinitely descending chain w.r.t. then it is in particular stoppered. 

Theorem 1. Let he an inconsistency order on a logical lattice {C,T>) such 

that is stoppered (with respect to the induced order on valuations). Then 
r l=f iff r \=), if for some 0 < * < 3 . 

In the rest of this section we prove Theorem 1. First, we consider some 
notations and definitions. 

Definition 9. Given a logical lattice its elements may be divided into 

the following four sets: 

rf’’^ = {xeL I x€V,^x^V}, T/’^ = {xGL I x^V,^x€V}, 

rf’^ = {xGL I x€V,^x€V}, rf’’^ = {x€L \ x^V,^x^V}. 

Henceforth we shall usually omit the superscripts, and write Tt,Tf,Tr, 71. 



Definition 10. Let {C,T>) be a logical lattice. Denote: 

min<c,c % = {yG% \ ~^3y' G 7^ s.t. y' <f y} {x G {t, /, T, _L}) 

FI^c.-d = min^c.-D 7t U min^c.u 7/ U min^£,x> 71 U min^c.B 71 

Definition 11. Let (£i,Pi) and be two logical lattices. Suppose that 

Xi is some element in Li and Vi is a valuation onto Li (f=l,2). 

a) xi and X 2 are similar if xi g 7^^^’^^ implies that X 2 {yG{t,f,T,±}). 

b) vi and i >2 are similar if for every atom p, i>i(p) and i’ 2 {p) are similar. 



Proposition 2. Let {Li,T>i) and {C 2 ,T> 2 ) be two logical lattices and suppose 
that vi and V 2 are two similar valuations on Li and L 2 (respectively) . Then for 
every formula ip, vi{4>) and V2{f’) are similar. 

Proof. By an induction on the structure of ip.^ 

Proof (of Theorem 1). We shall denote by mx some element in min^c.-D 
{x G {1 /, T,_L}), and by a; : L — >• {f, /, T,_L} the “categorization” function: 
Lo{y)=x iff y&Tx. Also, in the rest of this proof we shall abbreviate [y] fl L2^c,-d 
by [y] (thus we shall refer here to classes that consist only of elements in f2^c,T>). 

Lemma 1. If M G !(T, <f’^) then for every atom p, M{p)£l2^c,v. 

^ The notion “stopperdness” is due to Mackinson [31]. In [27] the same property is 
called smoothness. 

® Note that the fact that is a prime filter is crucial here. 
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Proof. Suppose that there is some atom po s.t. M{pt^)^ . Then, assuming 

that M{po) G Tx, there is an element rrix G min^c.vTx s.t. nix M{po). 
Consider the following valuation: 



N{p) 



rux if p = Po 

M{p) ifp^po 



N is similar to M, and so, by Proposition 2, N is also a model of P. Moreover, 
thus !(T,<f’^). □ 

Now, since is well-founded and since Tx is nonempty for every x G 

{t, /, T,_L}, min ^£,13 7^ is nonempty as well, and so there is at least one ele- 
ment of the form nix for every x G {t, /, T,_L}. Also, it is clear that for every 
m X, m'^ G min ^c,T> Tx, Vnx] = \’m'x] (otherwise either rrix<c'^ 'm'x or rUa, >f m(,, 
and so either m(, ^ min^^.u Tx or nix ^ min^^.u Tx). It follows, therefore, that 
there are no more than three equivalence classes in T^c.-d: 

min^c.T, Tt^ min ^c.-oTf Q [t], min^^.u 71 C [m±], min^c.c 71 1 [mr], 

where m± is some element of min ^£,13 71, and my is some element of 
min^c.-D 71. By Definition 4, [t] must be a minimal inconsistency class among 
those in and [tot] must be a maximal one. It follows, then, that the 

inconsistency classes in are arranged in one of the following orders: 

0. [t] = [m±] = [tot], 2. [t] <f’'^ [to_l] = [tot], 

1. [f\ = [tot] [wt], 3. [t] <f’'^ [tot] <c'^ [wt]- 

If the order relation among the inconsistency classes in f2^c,T> corresponds to 
case i above (0<i<3) we say that the inconsistency order is of type t.® 

Lemma 2. If is an inconsistency order of type i, then for every m,m' G 

n<c,-D, [TO]<f’^[TO'] iff[Lv{m)\<l.[u;{m')]. 

Proof. Immediate from the definition of inconsistency order of type i, and the 
definition of <f . □ 



Lemma 3. If inconsistency order of type i in {C,T>), then ]=f’^ is 

the same as . 

Proof. Suppose that T ip but P tp. Then there is a cf-mcm of 
r s.t. M'^{ip) ^ {1 T}. Now, for every atom p let M^{p) be some element in 
min^c,i3 Tm‘^(p) - Thus ojoM^ = M^, and is similar to M'^. By Proposition 2, 
is a model of P and it is not a model of ip. To get a contradiction to P [=f ip, 
it remains to show, then, that is a c-mcm of P in {C,T>). Indeed, otherwise 
by stopperdness there is a c-mcm of P s.t. So for every atom 

® In particular, for every 0<i<3, the inconsistency order <^. in TOIAIZ is of type i. 
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p, [N^{p)] [M^{p)], and there is an atom po s.t. [N^{po)] [M^(po)]- 

Let = ojoN^. Again, N'^ is similar to N^, therefore it is a (four- valued) 
model of F. Also, by the definition of M, for every atom p, M^{p) G and 

by Lemma 1, Vp N^{p) G f2^c,T>. Thus, by Lemma 2, 

[N'^ip)] = [<^oN^{p)] <4 [ojoM^{p)] = [M^(p)]. 

Also, by the same lemma, 

[iV‘‘(po)] = [u;oN^{po)] <X [(jJoM^{po)] = [M'^ipo)]- 

It follows that N'^ <^. AT*, but this contradicts the assumption that is a 
cf-mcm of r. 

For the converse, suppose that F \='^. tp, but F 'ip. Then there is a c-mcm 
of F in {C,T>) s.t. M^{'ip)^'D. Define, for every atom p, M'^{p)=LooM^{p). 
By the definition of w, is similar to and so AT* is a model of F in TOUTZ, 
but it is not a model of ip. It remains to show, then, that is a cf-mcm of 
F. Indeed, otherwise there is a model A^"* of F s.t. N'^ <f. M'^, that is, for every 
atom p [fV'^(p)] <f. [Af^(p)], and there is an atom po for which this inequality is 
strict: [A^^(po)] <% [-^^(Po)]- Now, for every atom p, let N^{p) be some element 
in mm^c,T> Thus uioN^ =N'^, and is similar to By Proposition 2, 

is in particular a model of F in (C,F). Moreover, for every atom p, 

[woiV-^(p)] = [N'^ip)] <f. [Af^(p)] = [a;oAf^(p)]. 

Now, by the definition of we have that for every atom p, N^{p) G Q^c,t>, and 
by Lemma 1, M^{p) G Q^c,t> as well. Hence, by Lemma 2, [N^{p)] <f [Af'^(p)]. 
Similarly, 

[ujoN^{po)] = [A^"‘(po)] = [woM^(po)] 

and again this entails that [N^{po)] [Af^(po)]- It follows that 

but this contradicts the assumption that is a c-mcm of F in {£,T>). 

This concludes the proof of Lemma 3 and Theorem 1 . □ 

Note F The proof of Theorem 1 also induces a simple algorithm for determin- 
ing which one of the four-valued consequence relations is the same as a given 
consequence relation of the form given an inconsistency order <f’^ in 

{£,£>), choose some m± G min^c.u 71 and my G min^c.cTr- If [itt-t] = [t] 
then \=^’^ = \=fg. Otherwise, if [m±] = [t], then j=^F = \=f^. Otherwise, if 
[my] = [toj_], then = \=f^. Otherwise, = hca- 

4 Reasoning with |=f 

We conclude by briefly considering some basic properties of In what 

follows we assume stopperdness, and so, by Theorem 1, it is sufficient to consider 

Most of the propositions in this section easily follow from similar results concerning 
modular preferential relations, considered in [1]. Due to space limitations, corre- 
sponding proofs are omitted. 
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TOUTZ and the four corresponding consequence relations \=*. (t = 0,. . . , 3). First, 
we consider the relative strength of these logics: 

Proposition 3. Let F he a set of formulae and ip a formula in S. 

a) The consequenee relations \=i., 0<t<3, are all different. 

b) For every 1 < t < 3, if F 1=^^ ip then F \=‘^. ip. 

c) No one of 1=^^, \=t 2 ’ Hi’ stronger than the other. 

In what follows we shall write \='^ for the classical consequence relation, and 
1=^ for any one of |=^., 0<i<3. As the next proposition shows, reasoning with 
1=^ does not reduce to triviality when the set of premises is not consistent. 

Proposition 4. |=^ is paraconsistent. 



Proposition 5. If F\=^ip then F\=‘^ip. 

The converse of Proposition 5 is not true in general. For instance, excluded 
middle is not valid w.r.t. \='^^ and \=t^ - However, with respect to the other basic 
four- valued consequence relations, the converse of Proposition 5 does hold. 

Proposition 6. Let F be a classically consistent theory. Then for every formula 
Ip in S we have that F\=^ip iff F\='^^ip iff F\=^^ip. 

By Propositions 4 and 6, it follows that with (any consequence relation of 
the form that is equivalent to) 1=^^ and 1=^^ one can draw classical con- 

clusions from (classically) consistent theories, while the set of conclusions is not 
“exploded” when the theory becomes inconsistent. Batens [10] describes this 
property as an “oscillation” between some lower limit (paraconsistent) logic and 
an upper limit (classical) logic. 

Proposition 7. is a monotonic consequence relation, while |=^. , z=l,2,3, 
are nonmonotonic relations. 

The last proposition implies that unless the inconsistency order is degener- 
ated, l=f’^ is not monotonic, thus it is not a consequence relation in the sense 
of Tarski [38]. In such cases it is usual to require a weaker condition: 

Proposition 8. [24,27] \=^ satisfies cautious left monotonicity: if F ip and 
F\=‘^(p, then F,ip\=‘^(p. 

A desirable property of non-monotonic consequence relations is the ability to 
preserve any conclusion when learning about a new fact that has no influence on 
the set of premises. Consequence relations that satisfy this property are called 
rational [28]. The next proposition shows that ]=^. (z = 0, . . . ,3) are rational. 

Proposition 9. If F \=^^ip and A{F U {ip}) H A{(p) = 0, then F,(p\=fip.^^ 
Recall that A{F) is the set of atomic formulae that appear in some formula of F. 
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Intuitively, the second condition in Proposition 9 guarantees that </> is ‘irrele- 
vant’ for r and ip. The intuitive meaning of Proposition 9 is, therefore, that the 
reasoner does not have to retract ip when learning that (j) holds. 

Note 2. In order to assure rationality, Lehmann and Magidor [28] introduced 
the rule of rational monotonicity: if then T, (/)|~?/;, unless 

Rational monotonicity may be considered as too strong for assuring ratio- 
nality, and many general patterns of nonmonotonic reasoning do not satisfy this 
rule. For instance, is rational (Proposition 9), but it does not satisfy rational 
monotonicity (consider, e.g., F = {p, g— ip = ->p^->q, and <f> = q). 

In terms of Batens [10,11], and are also adaptive, i.e.: if it is possible 
to distinguish between a consistent part and an inconsistent part of a given 
theory, then every assertion that classically follows from the consistent part, and 
is not related to the inconsistent part, is also a ^^.-consequence (z = 2,3) of the 
whole theory. Thus, as the following proposition shows, and presuppose 
a consistency of all the assertions ‘unless and until proven otherwise’. 

Proposition 10. Let F = F'UF" be a set of formulae in S s.t. F' is classically 
consistent and A{F') A{F”) = %. Then for every ip s.t. A(ip) n.4(T") = 0, the 

fact that F' \='^ ip entails that F ip and F ip. 

We conclude by noting that consequence relations of the form ]=f’^ natu- 
rally generalize some related formalisms such as the consequence relations l=j-’ , 
introduced in [5,6], and the logic LPm of Priest [34].^^ 
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Abstract. Through minimal-model semantics, three- valued logics provide an in- 
teresting formalism for capturing reasoning from inconsistent information. How- 
ever, the resulting paraconsistent logics lack so far a uniform implementation 
platform. Here, we address this and specifically provide a translation of two such 
paraconsistent logics into the language of quantified Boolean formulas (QBFs). 
These formulas can then be evaluated by off-the-shelf QBF solvers. In this way, 
we benefit from the following advantages: First, our approach allows us to har- 
ness the performance of existing QBF solvers. Second, different paraconsistent 
logics can be compared with in a unified setting via the translations used. We 
alternatively provide a translation of these two paraconsistent logics into quanti- 
fied Boolean formulas representing circumscription, the well-known system for 
logical minimization. All this forms a case study inasmuch as the other exist- 
ing minimization-based many-valued paraconsistent logics can be dealt with in a 
similar fashion. 



1 Introduction 

The capability of reasoning in the presence of inconsistencies constitutes a major chal- 
lenge for any intelligent system because in practical settings it is common to have 
contradictory information. In fact, despite its many appealing features for knowledge 
representation and reasoning, classical logic falls in a trap: A single contradiction may 
wreck an entire reasoning system, since it allows for deriving any proposition. This com- 
portment is due to the fact that a contradiction denies any classical two-valued model, 
since a proposition must be either true or false. We thus aim at providing formal reason- 
ing systems satisfying the principle of paraconsistency: {a, -^a} \f (3 for some a, [3. In 
other words, given a contradictory set of premises, this should not necessarily lead to 
concluding all formulas. 
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The idea underlying the approaches elaborated upon in this paper is to counterbalance 
the effect of contradictions by providing a third truth value that accounts for contradictory 
propositions. As already put forth in [27], this provides us with inconsistency-tolerating 
three-valued models. However, this approach turns out to be rather weak in that it in- 
validates certain classical inferences, even if there is no contradiction. Intuitively, this 
is because there are too many three-valued models, in particular those assigning the 
inconsistency-tolerating truth- value to propositions that are unaffected by contradictions. 
For instance, the three-valued logic LP [27] denies inference by disjunctive syllogism. 
That is, (3 is not derivable from the (consistent!) premise (a V /3) A ->a. As pointed out 
in [15], this deficiency also applies to the closely related paraconsistent systems J 3 [17], 
L [22], and RP [19]. As a consequence, none of the aforementioned systems coincides 
with classical logic when reasoning from consistent premises. 

The pioneering work to overcome this deficiency was done by Priest in [28]. The 
key idea is to restrict the set of three-valued models by taking advantage of some pref- 
erence criterion that aims at “minimizing inconsistency”. In this way, a “maximum” of 
a classically inconsistent knowledge base should be recovered. While minimization is 
understood in Priest’s seminal work [28], proposing his logic LPm, as preferring three- 
valued models as close as possible to two-valued interpretations, the overall approach 
leaves room for different preference criteria. Another criterion is put forth in [9] by giving 
more importance to the given knowledge base. In this approach, one prefers three- valued 
models that are as similar as possible to two-valued models of the knowledge base in 
the sense that those models assign true to as many items of the knowledge base as pos- 
sible. Furthermore, [21] considers cardinality-based versions of the last two preference 
criteria. Even more criteria are conceivable by distinguishing symbols having different 
importance. 

However, up to know, all these advanced approaches lack effectively implementable 
inference methods. While Priest defines LPm in purely semantical terms, a Hilbert cal- 
culus comprising 26 axiom schemata is proposed by Besnard and Schaub [9] for ax- 
iomatizing their approach. Also, inference is not at issue in [21]. This shortcoming is 
addressed in this paper. To wit, we develop translations for the three-valued paraconsis- 
tent logics defined in [28] and [9], More precisely, our translations allow for mapping 
the respective entailment problems into the satisfiability problem for quantified Boolean 
formulas (QBFs). These formulas can then be evaluated by off-the-shelf QBF solvers. 
The motivation of this particular approach to implementing these logics (as opposed to 
more direct calculizations) stems from its unique uniformity, even beyond the framework 
of three-valued logics. In fact, we have already developed in a companion paper [11] 
similar translations for a rather different family of paraconsistent logics, called signed 
systems [ 10 ]; a forthcoming paper deals with approaches to paraconsistency based on 
the selection of maximally consistent subsets [24,8]. 

Our general methodology offers several benefits: First, we obtain uniform axiomati- 
zations of rather different approaches. This allows us to compare different paraconsistent 
logics in a unified setting. Second, once such an axiomatization is available, existing QBF 
solvers can be used for implementation in a uniform manner. The availability of efficient 
QBF solvers, like the systems described in [12,20,6], makes such a rapid prototyping 
approach practicably applicable. Third, these axiomatizations provide a direct access to 
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the complexity of the original approach. Conversely, we can exploit existing complexity 
results for ensuring the adequateness of our axiomatizations. Finally, we remark that this 
approach allows us, in some sense, to express paraconsistent reasoning in (higher order) 
classical propositional logic and so to harness classical reasoning mechanisms from (a 
conservative extension of) propositional logic. 



2 Paraconsistent Three- Valued Logics 



We deal with a language C over a set V of propositional variables and use the logical 
symbols T, _L, -i, V, A, and to construct formulas in the standard way. Formulas are 
denoted by Greek lower-case letters (possibly with subscripts). 

An interpretation is a function v : V ^ {t, f, o} extending to iJ : £ — >■ {f, /, o} 
according to the truth tables below. 





( 1 ) 



We sometimes leave an interpretation v implicit and write p : x instead of v{p) = x, 
for X G {t, /, o}. An interpretation v is said to be two-valued whenever v{p) G {t, /} 
for all p G V', otherwise, it is three-valued. A three-valued model of a formula a is 
an interpretation that assigns either f or o to a. Modelhood extends to sets of formulas 
in the standard way. As usual, given a set S of formulas and a formula (p, we define 
S' ^ if each model of S is a model of p. Whenever necessary, we write [=3 and |=2 to 
distinguish three-valued from two-valued entailment. 

Note that the truth value of a ^ 13 differs from that of -la V (3 only in the case of 
V = {a : o, (3 f} resulting in v{a -G (3) = f and v{-'a y (3) = o. This difference is 
prompted by the fact that t and o indicate modelhood, which motivates the assignment of 
the same truth values to a — /3 no matter whether we have a : tor a : o. This has actually 
to do with the difference between modus ponens and disjunctive syllogism: The latter 
yields (3 from a A ~ia A ^(3 because a V /? follows from a. The overall inference seems 
wrong because in the presence of a A ^a, a V /3 is satisfied (by a : o) with no need for (3 
to be t. As pointed out in [21], one may actually view — >■ as “the ‘right’ generalization of 
classical implication because — >■ is the internal implication connective [ 5 ]for the defined 
inference relation in the sense that a deduction (meta)theorem holds for it: S t\ a ^3 j3 
ijf S \=^ a ^ (3. ” On the other hand, a formula composed of the connectives - 1 , V, 
and A can never be inconsistent; that is, each such formula has at least one three-valued 
model [13]. Finally, we mention that the entailment problem for ^3 is coAlP-complete, 
no matter whether — is included or not [26,13,15]. 

As mentioned in the introductory section. Priest’s logic LPm [28] was conceived 
to overcome the failure of disjunctive syllogism in LP [27]. LP amounts to the three- 
valued logic obtained by restricting C to connectives - 1 , V and A (and defining a ^ (3 
as -iQf V (3). In LPm, modelhood is then limited to models containing a minimal number 
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of propositional variables being assigned o. This allows for drawing “all classical in- 
ferences except where inconsistency makes them doubtful anyway” [28]. Formally, the 
consequence relation of LPm can be defined as follows. For three-valued interpretations 
V, w, define the partial ordering 

V <m u! iff {p GV \ v{p) = o} C {p gV \ w{p) = o} . 

Then, T \=m f iff every three-valued model of T that is minimal with respect to <m is 
a three-valued model of f. 

Unlike this, the approach of Besnard and Schaub [9] prefers three-valued models 
that assign true to as many items of the knowledge base T as possible: For three-valued 
interpretations v, w, define the partial ordering 

V <n W iff {f GT \ v{f) = o} c {f gT \ w{4>) = o} . 

Then, T \=n f iff each three-valued model of T which is <„-minimal is a three-valued 
model of (f>. 

The major difference between the last two approaches is that the restriction of model- 
hood in LPm focuses on models as close as possible to two-valued interpretations, while 
the one in the last approach aims at models next to two-valued models of the considered 
premises. According to [9], the effects of making the formula select its preferred models 
can be seen by looking at T = {p, -ip, {-•p V q)}: While LP^ yields two <m-preferred 
models, {p : o,q : t} and {p : o,q : /}, from which one obtains p A ~<p, the second 
approach yields q as additional conclusion. In fact, {p : o,q : t} is the only <„-preferred 
model of the premises {p, ->p, {-•p V g)}; it assigns t to {-•p V q), while this premise 
is attributed o by the second <m-preferred model {p : o, g : /}; hence the latter is not 
<u -preferred. So, while T q and T \=n q, we note that T U {(p V “■(?)} q for 
I = m,n. On the other hand, \=n is clearly more syntax- dependent than \=m since the 
items within the knowledge base are used for distinguishing <„-preferred models. 

In fact, both inference relations \=m and \=n amount to their classical (two-valued) 
counterpart whenever the set of premises is classically consistent. Also, it is shown 
in [15] that deciding entailment for \=^ and \=n is TTf -complete, no matter whether 
is included or not. A logical analysis of both relations can be found in [21] and in the 
original literature [28,9]. 

3 Axiomatizing Three- Valued Paraconsistent Logics 

In what follows, we provide axiomatizations of the three-valued paraconsistent logics 
introduced in the last section in terms of QBFs. 

Quantified Boolean Formulas. As a conservative extension of classical propositional 
logic, quantified Boolean formulas (QBFs) generalize ordinary propositional formulas 
by the admission of quantifications over propositional variables (QBFs are denoted by 
Greek upper-case letters). Informally, a QBF of form \/p3q(p means that for all truth 
assignments of p there is a truth assignment of q such that <P is true. Given that /C is 
the language of QBFs over a set V of propositional variables, the semantical meaning 
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of QBFs can be defined as follows: An interpretation is a function v : V ^ {t, f} 
extending to v : 1C ^ {t, /} according to the truth tables in (1) and the following two 
conditions, for every ^ G /C, 

v(yp<P) = v{<P[p/T] A ^[p/_L]) and v{3p<P) = v{<P[p/T] V ^[p/_L]) . 

We write . . . ,Pn/4>n\ to denote the result of uniformly substituting each free 

occurrence' of a variable pi in ^ by a formula 4>i, for 1 < i < n. If ^ contains no 
free variable occurrences, then is closed. Closed QBFs are either true under every 
interpretation or false under every interpretation. Hence, for closed QBFs there is no 
need to refer to particular interpretations. 

In the sequel, we use the following abbreviations: The set of all atoms occurring 
in a formula (j) is denoted by var{(j>). Similarly, for a set S of formulas, var{S) = 
UcjiGS For a set P = {pi, . . . ,p„} of propositional variables and a quantifier 

Q G {V, 3}, we let QP <P stand for the formula QpiQp 2 ■ ■ ■ Qpn Furthermore, for 
indexed sets S = . . . , 4>n} and T = {ipi, . . . , of formulas, S <T abbreviates 

Ai=ii4>i V'i)’ S <T stands for S' < T A ->{T < S). 



Encoding Three-Valued Logic. We start with encoding the truth evaluation of the three- 
valued logic given in Sect. 2 by means of classical propositional logic. 

To this end, we introduce for each atom p a globally new atom p' and define V = 
{p' I p G P} for a given alphabet P. 

Let u be a three- valued interpretation over alphabet P. We define the associated 
two-valued interpretation V 2 by setting 

V 2 {p) = V 2 {p') =t if v{p) = t; 

V 2 {p) = V 2 {p') = f if v{p) = /; 

V 2 {p) = f and V 2 {p') =t if u(p) = o, 

for any p G P and any p' G P'. Conversely, for a given two- valued interpretation v 
over alphabet PUP' such that v(p — >■ p') = t, we define the associated three-valued 
interpretation by setting 

ifv{p)=v{p') 

if u(p) = / and u(p') = f 



for any p G P. 

Moreover, we need the following parameterized translation: 



Definition 1. For p G P and C C, we define 



1. (a) r(p,f) =p; 
ib) t{p, f) = -.p'; 

(c) r(p,o) = -ip Ap',' 

2. (a) 

* An occurrence of a propositional variable p in a QBF <P is free if it does not appear in the scope 
of a quantifier Qp (Q G {V, 3}). 
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(b) r(-.^,/) = 

(c) o) = T{(j), o); 

3. (a) t{(P A = T{<f>,t) A T{'tp,t); 

(b) T{(j) A V', /) = f) V T{tp, /); 

(c) t{4> Aip,o) = Aij), f) A A tp, t); 

4. (a) T{(j)\/ pj,t) = T{<p,t)\/ 

(b) T{(j) V t/j, /) = T{(j), f) A Ti-tp, /); 

(c) t{ 4> y Ip, o) = -'t{(P y Ip, t) a v ip, /); 

5. (a) T{(p ip,t) = T{(p, /) V r{ip, t); 

(b) T{(p ~^ip,f) = -it{<P, /) A T{ip, /); 

(c) r{(p ~^ip,o) = ^r{(p, f) A r{ip, o). 

For computing the three- valued models of a set T = {<pi, ..., (pn} of formulas, we 
use A</>eT /) abbreviate the latter by ~'t{T, f)} 

For example, consider T = {p, ->p, {-•p V g)}. We get: 

-^r{T, /) = -.r(p, /) A ~^t{^p, f) A ^t{{~^p V g), /) 

= -.-.p' A ~^t{p, t) A ~^{t{^p, /) A r(g, /)) 

= p' A^p A ~^{t{p, t) A -ig') 

= p' A^p A {-•p V -•-•q') 

= p' A -<p . 

The resulting formula possesses four two-valued models, all of which assign p : f and 
p' : t while varying on g and g'. In order to establish a correspondence among the four 
two-models of ~'t{T, f) and the three three-valued models of T, assigning o to p and 
varying on g, the relation between the two alphabets V and V must be fixed. In fact, 
this is accomplished by adding r ^ r' for every r G V. 

In this way, we obtain the following result. 

Theorem 1. Let cp be a formula with P = var{(p), let P' = {p' \ p G P}, and let 
X G {t,f,o}. 

Then, the following conditions hold: 

1. For any three-valued interpretation v over V, ifv{(p) = x, then V 2 {{P < P') A 
r{(p, x)) = t, where V 2 is the associated two-valued interpretation ofv. 

2. For any two-valued interpretation v over V U V , ifv(fP < P') A r{(p, x)) = t, 
then V 3 {(p) = X, where is the associated three-valued interpretation ofv. 

Since the formula t{<P, t) V t{<P, /) V T{(p, o) is clearly a tautology of classical logic, we 
immediately get the following relation between the three- valued models of a theory and 
the two- valued models of the corresponding encoding: 

Corollary 1. Let T be a finite set of formulas with P = var(T) and let P' = {p' \ p G 

P}- 

Then, there is a one-to-one correspondence between the three-valued models ofT 
and the two-valued models of the formula 

(P<P')A-r(r,/). 

^ Note that generally x{(p, x) f (p, x). 



(2) 
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In particular, the three-valued model ofT corresponding to a two-valued model v 
of { 2 ) is given by the associated three-valued interpretation V 3 ofv. 

For illustration, consider T = {p, -•p, (-ip V g)} along with 

({p, q} < {p', g'}) A -.r(T, /) = (p -)> p') A (g -)> q') A (p' A -ip) . 

Unlike above, we obtain now three two-valued models, {p : f,p' : t,q : t,q' : t}, 
{P ■ f,p' ■ t,q ■ f,q' ■ t}, and {p : f,p' : t,q : f,q' : /}, being in a one-to-one 
correspondence with the three three-valued models, {p : o,q : t}, {p : o,q : o}, and 
{p : o, <7 : /}, of T, respectively. 

The role of the implications {p, q} < {p', q'} can be further illustrated by looking 
at the following translation: 

r(T, f) = t{p, t) A T(-ip, t) A r((-.p V q),t) 

= p A r(p, /) A (t(-.p, t) V r{q, t)) 

= p A -ip' A (t(p, /) V q) 

= p A -'p' A {-'p' V q) 

= p A -ip' . 



While the last formula admits four two-valued models, the formula ({p, q} < {p' , q'}) A 
T{T,t) has no two-valued model, which corresponds to the fact that T has no three- valued 
model assigning t to all members of T. 

Consequently, since there are no three- valued models assigning t “to” T, the formulas 
-'t(T, /) and r(T, o) must be equivalent; this can be verified as follows: 

r(T, o) = -.r(T, /) A -.r(T, t) 

= /) V T(-ip, /) V T(-ip V q, /)) A -.(p A ~^p) 

= -.(-.p' V p V (t(-ip, /) A r{q, /))) A (-.p V p') 

= “'(“'P^ V p V (p A “'(zO) A (“'P V p') 

= p' A -'p A (-'p V g') A (-'p V p') 

= p' A -<p . 

Encoding Three-Valued Paraconsistent Logics. To begin with, it is instructive to see that 
the previous elaboration already allows for a straightforward encoding of three-valued 
entailment, and, in particular, inference in the logic LP [27]: 

Definition 2. Let T be a set formulas and f a formula. 

Lor P = var{T U {</>}), we define 

r3(T,0)=VP,P'(((P<P')A-r(r,/)) . 



Then, we have the following result. 
Theorem 2 . T\=3(j)iffT3{T,(j))is true. 




Paraconsistent Reasoning via Quantified Boolean Formulas, II 535 



To be precise, we obtain (original) inference in LP [27] when restricting T and (f> to 
formulas whose connectives are among A, and V only. 

Let us now turn to Priest’s logic LP^ [28]. For this, we must, roughly speaking, 
enhance the encoding of LP in order to account for the principle of “minimizing incon- 
sistency” used in LPm- This is accomplished in the next definition by means of the QBF 
named Minm{T). 

Definition 3. Let T be a set formulas with P = var{T), V an indexed set of globally 
new atoms corresponding to P, and (j) a formula. Moreover, let Op = {r(p, o) \ p G P} 
and Oy = {r(u, o) \ v G V}. 

We define 

Mrnm{T) = (P < P') A y'((Oy < Op) A (V < V') A -r(T[P/f^], /)) 
and, for R = PU var{(j)), 

Tm{T, fi) = VP, R'({Minm{T) A -r(T, /)) ^ /)) . 

For illustration, let us return to T = {p, -•p, {-•p V g)}. We have P = {p, q} and 
correspondingly V = {u, w}. We start our analysis on the subformula 

{Oy <Op)A{V <V')A^t{T[P/V]J), (3) 



having 



Oy < Op = {t{u, o) -a t{p, o)) a {t{v, o) -a r{q, o)) A 

o) -)> t{u, o)) a (r(g, o) t(v, o))) . 

From the definition of < and <, one can see that {Oy < Op) A (F < V) is true under 
a two-valued interpretation v iff 

- for any variable from V assigned o under the associated interpretation vs, the cor- 
responding variable from P is also assigned o under vg; and 

- there exists at least one variable from V which is not assigned o under V 3 , although 
the corresponding variable from P is assigned o under V3. 

Additionally, v has to be a two-valued model of ~'T{f,T[P/V]). By Corollary 1, V 3 
then has to be a three- valued model of T[P/C], From our previous discussion and by 
renaming, we know that T[P/y] possesses three three-valued models, viz. {u : o,v : t}, 
{u : o,v : o}, and {u : o,v : /}. In the case of model {u : o,v : o}, we cannot find 
an assignment to p, q which has more variables being assigned o. The other two cases 
extend to two two-valued models, v' and v”, of (3) with their associated three-valued 
interpretations v '3 and v'f given hy {p : o,q : o,u : o,v : t} and {p : o,q : o,u : o,v : f}, 
respectively. Now, one can check that the only three-valued interpretation w such that 
Minm{T) is false under W 2 is {p : o,q : o}. Recalling the three- valued models of T, 
we have that the two-valued models of Mirim{T) A ~'t{T, f) yield two three- valued 
models, {p : o,q : t} and {p : o,q : /}. 

In general, we have the following result. 
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Theorem 3. T \=m <!> iffTm{T, (j)) is true. 

To be precise, we obtain (original) inference in LP„i [28] when restricting T and <j) to 
formulas whose connectives are among A, and V only. 

Analogously, we can now give an axiomatization of Besnard and Schaub’s ap- 
proach [9]. 

Definition 4. Let T be a set formulas with P = var{T), Q an indexed set of globally 
new atoms corresponding to P, and (j) a formula. Moreover, let Ot = o) \ f G T} 

and Ot[p/q] = {r((/),o) | f G T[P/Q]}. 

We define 

Minn{T) = (P < P') A -'3(5, Q' (^{Ot[p/q] < Op) L {Q < Q')j 
and, for R = P\J var{(jf), 

=VP,P'((Mzn„(T)A-r(T,/)) 

The salient difference between the previous definition and the one given in Defini- 
tion 3 manifests itself in the sets Op, Oy and Op, Ot[p/q]> respectively. Note that the 
latter take the original set of premises T into account so that the translation formula 
-<t{T[P/V], f) can he dropped. 

Theorem 4. T \=n f ijf%i{T, f) is true. 

Alternative Encodings. In view of the discussion given helow, we may alternatively 
capture both approaches as follows. 

To begin with, concerning LPm, we introduce additional new variables S' = {sp | 
p G var{T)} and S' = {s^ | p G var{T)}, and define 

MiUm'iT) = (P < P') A (S < Op) A 

-3S', Q, Q'((S' < S) A (g < g') A (S' < Oq) a ^r{T[P/Q],f)) 

and 

f) = VS, P, P'({Min^,{T) A -r(T, /)) ^ /)) . 

For Besnard and Schaub’s approach [9], on the other hand, we similarly introduce ad- 
ditional new variables according to the elements of T, viz. S = {S(j, \ f G T} and 
S' = {s^ I (j) G T}, and define 

Mirin' (T) = (P < P') A (S < Op) A 

-3S', g, g'((s' < S) A (g < g') a (s' < Oq)) 

and 

Tn'{T, 4>) = VS, P, P'((Mzn„'(T) A -r(T, /)) ^ /)) . 

In analogy to Theorems 3 and 4, we then obtain the following result: 

Theorem 5. T \=^ 4> ifflZ' {T, <jf) is true, for both v G {m, n}. 




Paraconsistent Reasoning via Quantified Boolean Formulas, II 537 



Employing Circumscription. In order to shed some more light on the two paraconsistent 
logics discussed above, let us slightly reformulate their minimization axiom in terms 
of circumscription [25]: Let T be a propositional theory and (P, Q, Z) a partition of 
var{T). Assume two (two-valued) models v,v' of T, and define v <p_z v' iff the 
following conditions are satisfied: 

1. {q€Q\ v{q) =t} = {q€Q\ v'{q) = t}; 

2. {p e P \ v{p) =t}C{p£P\ v'{p) = t}. 

A model t; of T is called (P; Z)-minimal if no model v' of T with v' ^ v satisfies 
v' <P-Z V. 

Informally, the partition (P, Q, Z) can be interpreted as follows: The set P contains 
the variables to be minimized, Z are those variables that can vary in minimizing P, and 
the remaining variables Q are fixed in minimizing P. 

Let T be a theory and (P, Q, Z) a partition of var{T), where P = {pi, . . . ,p„} 
and Z = {zi, . . . , Zm}- The set of (P; Z) -minimal models of T is given by the truth 
assignments to the QBF 

Circ{T; P;Z)=T A ^3P, z((P < P) A T[P/P, Z/Z]) , 

where P = {pi , . . . , p„} and Z = {zi, . . . , z^} are sets of new variables corresponding 
to P and Z, respectively. 

Then, for P = var{T), we have that Mirim.' {T) A /) can be written as 

Cm{T) = Circ{{P < P') A{S< {t{p, o) \ p G P}) A -r(P, /); S;PU P'), 

where S = {sp | p G var{T)}, and, analogously, Mirin'{T) can be written as 

C„(T) = Circ{{P < P') A{S< {r(0, o) | 0 G T}); S;PU P') 

where S = {s^ | 4> G T}. 

Summarizing, we have^ 

L Tomtit iff {Cm(T) A MT, /)) ^ /) is true; 

2. T iff (C„(T) A -r(P, /)) ^ /) is true. 

This demonstrates how the principle of circumscription can be exploited for character- 
izing the minimization process in the two considered paraconsistent logics. 

4 Related Work 

A whole variety of approaches uses lattices for dealing with inconsistency, e.g., [1, 
7,29]. For instance, [1,2] describes a system based on four- valued logic that allows 
for constraining “the most consistent” models in the meta-level by a user-given set of 
propositions taking classical truth-values only. In fact, in [3] the preference relation <m 
is generalized to four-valued logics, giving rise to two distinct orderings: Given two 
four-valued interpretations over truth values {f, /, o, o'}"^, define 

^ In fact, {Cm{T) A ~'t{T, /)) — >• f) is true iff Cm{T) f) is true. 

In a four- valued setting, o, o' are usually denoted by _L, T. 
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- V <\wifi {p & V \ v{p) = o} C {p \ w{p) = o}; and 

- V <2 wiff {p eP \ v{p) G {o, o'}} C {p eP \ w{p) G {o, o'}}. 

As with \=m, the models minimal with respect to these orderings are then used to define 
two distinct four-valued consequence relations. Although we do not detail it here, we 
mention that an appropriate encoding of the underlying four-valued logic (similar to 
the one given in Definition 1), along with a slightly generalized QBF encoding (similar 
to the one given in Definition 3), allows for a straightforward encoding of the two 
four-valued consequence relations by means of QBFs. Interestingly, both four-valued 
paraconsistent logics have recently been implemented in [4] by appeal to special-purpose 
circumscription solvers [16]. Furthermore, [14] proposes a translation-based approach 
to reasoning in the presence of contradictions that translates a logic into a family of other 
logics, e.g., classical logic into three-valued logics. 

Among the existing inference methods for three-valued paraconsistent logics, we 
mention the following ones. A resolution-based system close to LP yet with a stronger 
disjunction is described in [23]. In fact, there is an indirect way of implementing LPm 
because its consequence relation has recently been shown in [15] to be equivalent to a 
particular relation within the family of signed systems [10], whose inference can also 
be mapped onto QBFs, as shown in [11]. The resulting encoding is, however, of little 
interest since it lacks the spirit of “minimizing inconsistency” and thus fails to provide 
insight into LP^', also, it is not extendible with the genuine implication — > or even to 
alternative approaches such as [9]. We recall from the introductory section that the latter 
approach was originally axiomatized in [9] by means of a Hilbert system comprising 26 
axiom schemata. 

5 Conclusion 

Considering two paraconsistent logics based on a minimization principle applied to a 
three-valued logic, we have shown how a translation into the language of quantified 
Boolean formulas is possible. The translations obtained clearly fall under the same um- 
brella, giving rise to a uniform setting for the axiomatization of such logics. (In particular, 
we have provided translations explicitly displaying the connection with circumscription, 
the classical proof theory for logical minimization.) Moreover, once such an axiomati- 
zation is available, existing QBF solvers can be used for implementation without further 
ado. Having efficient QBF solvers, like the systems described in [12,20,6], makes such 
a rapid prototyping approach practicably applicable. Finally, we remark that what we 
did allows us, in some sense, to express this kind of paraconsistent reasoning in (higher 
order) classical propositional logic and so to harness classical reasoning mechanisms 
from (a conservative extension of) propositional logic. 
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Abstract. According to the standard definition, a logic is said to be 
paraconsistent if it fails the (so-called) rnle of ex falso: i.e., a^-^a \f !3. 
Thus, paraconsistency captnres an important sense in which a logic is 
inconsistency-tolerant, namely when arbitrary inference is prohibited in 
the presence of inconsistencies. We investigate a family of notions of 
paraconsistency within the context of modal logics. 

An illustration comes from an epistemic version of the lottery paradox 
showing how important it sometimes is to distingnish between ordinary 
and higher order beliefs: A logic may fail to tolerate inconsistent beliefs 
at a given level while tolerating inconsistent beliefs across different levels. 
We identify a few properties arising from some well-known modal the- 
orems and inference rules in order to classify modal logics according to 
their capacity to tolerate modalized inconsistencies. In doing so, we show 
various relationships among these logics. 



1 Introduction 

In the extensional case, paraconsistency is a feature of logics that do not identify 
inconsistent theories with trivial theories (those that consist of all formulae of 
the logical language under consideration). Notably, the inadequacy of classical 
logic^ in this respect is illustrated by the so-called ex falso which is the following 
tautologous schema: 

1= a A —<a — 1- f3 

Indeed, the ex falso can be understood as specifying conditions of deductive 
breakdown - when inference can go in any direction. As intensionality enters 
the picture through modal operators, more notions of deductive breakdown and 
triggering conditions arise. In an extension of classical logic such as the standard 
modal logic K (we are to follow the naming convention adopted in [Chellas 1980]) 
for example, the following holds: 

\~x Oo A O—ta — >■ □/? 

^ Throughout the text, the symbol |= is strictly reserved for classical logic whereas h 
(whether subscripted or not) is used for other logics. 
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Thus, the conclusion set {□/? | /3 G £} (where C denotes the set of all formulas 
of the language, whether they are modal or not) can be seen as a modal kind of 
deductive breakdown of the logic. Note that a related version of such a modal 
ex falso is not provable in K\ 

[/x LIOq; a □□“la — >■ D/? 

More generally, the modal logic K enforces that inconsistencies split over iden- 
tical modalities 

\~x tUDa A □□-■a — 1 □□/? 

\~x nnOo; A □□□“'Q; — ^ □□□/3 



are contrasted with inconsistencies defined accross distinct modalities 

\/x tHa A □□-la — 1 □/? 

\/x nOa A O—ta — >■ □□/? 



This indicates that modalized inconsistencies of a certain modal depth may 
give arbitrary conclusions of the same modal depth but need not give arbitrary 
conclusions of any modal depth. 

Presumably the most striking example is an epistemic counterpart of the 
lottery paradox ([Kyburg 1997]). Consider for instance an agent who has the 
capacity to form (ordinary) beliefs (e.g., George is bald) as well as higher order 
beliefs about her own beliefs. If holding a large collection of beliefs, she may 
have grounds to believe that at least one of her ordinary beliefs is mistaken. But 
of each of her ordinary belief, she has no grounds to believe that this particular 
one is mistaken and so she does believe in each of her ordinary beliefs after all. 
Interpreting □ as the belief operator, we may write: 



□-.□(oi A . . . A a„) (1) 

□oi A ... A Oan (2) 

where each at stands for an ordinary belief and so contains no occurrence of □. 
As (2) above is equivalent to 



□ (aiA...Aa„) (3) 

in standard epistemic logic, the nature of the conflict between (1) and (3) should 
clearly be distinguished from believing contradictory claims: 

□a A O-ia (4) 

Interpreting □ as the knowledge operator, (1) and (3) are actually inconsistent in 
view of the famous principle □« — >■ a. And a stronger modal logic can generate 
even more inconsistencies - modalities interacting awkwardly with each other 
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(see [Meyer and van der Hoek 1998] [Priest 2002] [Wansing 2002] for discussions 
of epistemic inconsistencies). 

These considerations provide the impetus to develop a principled method to 
make finer distinctions than the original ex falso. In particular, we need at least to 
distinguish between extensional inconsistency and intensional inconsistency. As a 
logic may have more operators, for instance two negations, one being extensional 
but not necessarily the other, the interaction of operators also raises the issue 
of corresponding forms of the ex falso (see [Carnielli & Marcos 1999, 2002]). 

Even though some paraconsistent logics have been defined from modal logics 
[Perzanowski 1975] [Blaszczuk 1984] and [Beziau 2002], the idea here is a more 
general investigation of paraconsistency in a pure modal setting. 



2 Modal Paraconsistency Based on Conditional Formulae 



We consider modal logics over a propositional language involving the unary 
operator □. They may have none, some or all of the modal inference rules below: 



[RE] 



oi — )■ [3 (3 — )■ oc 

da -)> 0(3 



[RM] 



oi — ^ [3 
Da -)■ ap 



[RT] 



□a 

a 



im 



□a 



□ □a 

Of special interest are the following non-modal rules: 

a a — )■ j3 



modus ponens 






the transitivity rule 



a ^ j3 P ^ 7 

a — >■ 7 



The starting analysis involves the presence of an inconsistency symbol T in the 
language so that it is possible to rewrite the ex falso as: 



]= T — >■ /3 



Of course, P stands for all formulae of the language and these include the modal 
ones when it comes to a logic which is a modal extension of classical logic. 
Incidentally, this shows that modal paraconsistency requires the ex falso to be 
modified so as to have modalities occurring in the resulting schema: 



Modal paraconsistency is essentially modal. 
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Accordingly, the simplest modal counterpart of the ex falso is: 

□_L ^ Uj3 

Thus, D-L is the simplest modal inconsistency. Generalization is straightforward, 
with □"-inconsistency being G"_L and the ex falso taking the form: 

□"_L ^ □"/? 

Although not all the logics we will consider are extensions of classical logic, 
our analysis is based on classical antinomies: Letting T denote the class of all 
non-modal tautologous schemata, define A = {a | a — >■ _L G T}. 

We are now in the position to specify a generic notion of modal paraconsis- 
tency via conditional formulae: 

Definition 1. Let Lp & A. For n > 1, a logic is ip-paraconsistent iff it fails 
to have — >■ □"/? as a theorem. 

Note that this definition is parameterized to both cp and n. For different 
choices of ip and n, a logic need not behave the same with respect to those 
different cases for □"(/^-paraconsistency. Definition 1 also relies on theoremhood 
being non-empty in the modal logic under consideration and another approach 
is needed when it comes to logics without any theorems. This will be discussed 
in the next section about a rule-based formulation of modal paraconsistency. 

Example 1. Take (/? to be -i(a — >■ (/3 — >■ a)). Consider the modal logic S'0.9 that 
can be axiomatized by the tautologous schemata and □(a — >■ /3) — >■ (□« — >■ □/?) 
and □« — >■ a (as well as □« if a is in any of these three categories) together with 
the rule: from □(« — >■ /3) and □(/? — >■ a), infer □(□« — >■ □/?). For every n > 1, 
it fails to be □"(/3-paraconsistent. Many well-known modal logics (ATT, S'4, . . . ) 
accordingly fail to be □"-'(a — >■ (/? — >■ 0 ())-paraconsistent. 

Example 2. Take (p to be a/\~>a and if to be -•(aV-'Q;). Consider a (non-modal) 
strongly paracomplete logic (in symbols, 1/ -'(aV-'a) — >■ (3) which happens to fail 
paraconsistency (in symbols, h ckA-'Q; — >■ P) and define a minimal extension of it 
closed under the rule [RM], Such a modal logic tolerates the necessary rejection 
of the excluded middle (i.e., it is □"^/;-paraconsistent: 1/ □"-'(a V -^a) — >■ □"/?) 
without also tolerating the necessity of a “blatant contradiction” (i.e., it fails to 
be n"i^-paraconsistent: F □"(a A ->a) — >■ □”/?). 

The first question is: In what sense are some of these notions of paraconsis- 
tency weaker than one another? 

Property 1. Let n > 1. Lf a logic admitting [RM] is O^ip-paraconsistent then 
it also is Lp-paraconsistent for all m < n. 




544 P. Besnard and P. Wong 



Property 2. Let n > 1. If a logic admitting the transitivity rule and the well- 
known axiom schema 

T : Oa — >■ a 

is O'^ip-paraconsistent then it also is (p-paraconsistent for all m < n. 

Of course, T is needlessly strong. It could be replaced by the less familiar^ 
schema 4c : — >■ □«. 

While each of these two properties gives a sufficient condition for one level 
(modal depth) of paraconsistency to spread to some others (it spreads downwards 
from n), there is a limit where one level spreads to all others: 

Property 3. Let p & A. Consider a logic admitting the transitivity rule and 
the well-known axiom schemata 



T : □« — >■ a 
4 : □« — >■ 

It is p-paraconsistent in some m >l ijf it is p-paraconsistent in all n> 1. 

Again, T is needlessly strong. T and 4 could be replaced by just 4! which is 
4 augmented with its own converse, 4c (see above). 

In a sense. Property 3 is actually a degenerate case because T and 4 together 
(or equivalently 4!) shrink all □" (n > 1) prefix to □. In less degenerate cases, 
□" may shrink to □* for some arbitrary but fixed limit i < n. For instance, 
4! may be weakened so as to give a family of modal logics with the following 
versions of 4 and 4c: 

4* : -)> 4), : -)> 

In any of these logics, there is an exact counterpart to Property 3. The only 
change required is that m > i and n > i for the arbitrary but fixed limit i. 

Turning to the question of the relationship between □”(/?-paraconsistency 
and □"^/’-paraconsistency, consider now a logic which satisfies the principle of 
uniform substitution. For all n > 1, if such a logic is G"(^-paraconsistent then 
it is □"^/’-paraconsistent provided that p G A can be obtained from if G A by 
uniform substitution. There are of course other possibilities, for instance: 

Property 4. Consider a logic which admits both [RM] and the transitivity rule. 
For n > 1, if it is a O'^p-paraconsistent logic then it also is O^if-paraconsistent 
for all if G A such that p ^ if is a theorem of the logic. 

Weaker conditions are possible when considering the equivalence between 
□"(^-paraconsistency and □"^-paraconsistency. Stating that p and if are equiv- 
alent iff both p ^ if and if ^ p are theorems of the logic, it is enough to 
consider a modal rule of which [RM] is a special case: 

^ However, a temporal interpretation of modality identifies nDa — >■ □« with density. 
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Property 5. Let n> 1. Consider a logic having [RE] and the transitivity rule. 
Whenever (p and ip are equivalent in that logic, it is O'^p-paraconsistent iff it 
is O'^ip-paraconsistent. 

What does it take for a logic to have all the variations of □"■0-paraconsistency 
(for n fixed) to collapse into one and only □"(^-paraconsistency as all (^ G ^ are 
equivalent under deduction in classical logic? From Property 5, an answer is: 

Property 6. For every n > 1, a logic which is an extension of classical logic 
admitting [RE] is (p-paraconsistent iff it is O^ip-paraconsistent. 

Presenting sufficient conditions for a logic to fail to be D"(p-paraconsistent in 
all parameters {p and n) is worthwhile although this should not (unless n = 1) 
be taken to mean that all modal formulae are inferred from D^p where p € A, 
regardless of whether the logic admits modus ponens. Here is a class of logics 
that enjoy no D"(^-paraconsistent at all: 

Property 7. A logic which is an extension of classical logic and admits [RM] 
fails to be O^p-paraconsistent for all n > 1 and for all p € A. 

Note that Property 7 does not say that all logic admitting [RM] would fail 
to be □"iy9-paraconsistent. That systematically fails only for those which are 
extensions of classical logic. Other logics might still enjoy D"(/?-paraconsistency 
(even if they admit principles such as h _L — >■ (3 and [RM]) but there are certain 
necessary conditions. 

Interestingly enough, □"(y9-paraconsistency is not a notion arising only from 
(classical) antinomies governed by a □" modality. A modal ex falso can follow 
from the equivalence of all contradictions in the form: 

{a A “'Q:) — >■ (/3 a “1/3) 



Property 8. Consider a logic admitting [RE] and the well-known schemata: 

M : G(a A/3) Da A D/3 

C : DaA n/3 -)> □(« A/3) 
as well as the non-modal principles: 

Oi — )■ 3 3 — y y / \ / n 

a /\ (3 ^ a (a A -la) — >■ (/3 A -i/3) 

a — >■ 7 

For n > 1 and all non-modal a, the logic fails to he □"(a A -•a) -paraconsistent. 

In general, a modal logic can have the ex falso as a theorem and still be 
□"(a A -ia)-paraconsistent for any n > 1 but it is obvious that a logic which has 
the ex falso as a theorem and [RM] as an admissible rule fails, for all n > 1 and 
for all non-modal a, to be □"(« A -ia)-paraconsistent. A variation is: 
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Property 9. Consider a logic L whose non-modal inference within modality □” 
for some fixed n> 1 is given by an explosive non-modal logic L' as follows: 

\~L' O' ^ (3 => \~L □"« — >■ □"/? for all non-modal a and (3 

where explosive means that the ex f also is a theorem of the logic (V here). Then, 
L fails to be □"(a A ->a)-paraconsistent. 

In classical logic, trivialization is an immediate consequence of the ex falso: 
From a contradiction, every formula is deduced. Even if considering only modal 
logics that admit modus ponens, the situation is different here because modal 
depth makes room for levels upon which modal paraconsistency need not im- 
pose a uniform effect to take place: Specific trivialization answers the failure of 
□"(/5-paraconsistency, depending on n. By -trivialization, we mean that all 
formulae of the form □”/? are inferred. If n yf m, □"-trivialization is in general 
independent from □'"-trivialization unless the logic is too strong to exhibit any 
kind of paraconsistency at all: 

Property 10. Consider a logic such as in Property 3, additionally admitting 
modus ponens. In the event that it fails to be T-paraconsistent in all n > 1, 

any -inconsistency for some m > 1 entails -trivialization for all n > 1. 

Property 11. A logic such as in Property 9 for n = 1, that additionally admits 
modus ponens and enjoys the well-know schema: 

C : □oA □/d -)> □(« A/3) 

is such that □(« A ~<a) yields U -trivialization for all non-modal a. 

There exists a violation of the independence between □"-trivialization and 
□"“-trivialization that is of utmost importance: For any logic which admits 
modus ponens, even being □"(/?-paraconsistent in every € A does not prevent 
□"-trivialization to occur when there is a □’"-inconsistency for some m < n such 
that the logic fails to be □’"_L-paraconsistent. More generally: 

Property 12. Consider a logic admitting modus ponens. For all n > 1 and all 
(fi such that the logic fails to be O'^ip-paraconsistent, □"i^ yields -trivialization 
for all m > n. 

In a nutshell, this is the reason why logics such that □"(/^-paraconsistency 
implies □’"(/^-paraconsistency for all m < n should be preferred. 

Another way to look at the same phenomenon consists of switching to a rule- 
based formulation, as is done in the next section. Still another option would be 
to adopt an amended definition of □"(/9-paraconsistency (with the immediate but 
important consequence that being □"(/?-paraconsistent unconditionally implies 
being □’"(/9-paraconsistent for all m < n) as follows: 

Definition 2. Let :p & A. For n > 1, a logic is (p-paraconsistent iff there 
exists no u> 1 such that □“(/? — >■ □"/3 is a theorem of the logic. 
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In order to obtain a generalization (here given from Definition 1 but a formu- 
lation for Definition 2 is unproblematic) of □"(^-paraconsistency, consider the 
class of all non-modal tautologous schemata ot\ — >■ («2 —>■(■■■—>■ (cKp — >■ _L) . . .)) 
for all p > 1 . 

Definition 3. Let F G T he ai ^ («2 —>■(...—>■ (op — >■ _L) . . .)). For all n> 1, 
a logic is F -paraconsistent iff □"oi — >■ (□"02 —>■(...—>■ (□"op — >■ O"/?) . . .)) 

fails to be a theorem. 

All previous properties have counterparts here, provided that a few things 
are taken care of. For example, either [RM] is generalized to 

IRMP] cti -)> (q(2 -»(■■■ -» jap -»/?)■■ ■)) 

^ ^ Gai ^ (002 ^ (Gap ^ □/?) • ■ •)) 

or the following modal and non-modal principles apply: 

□ (a — >■ /3) — >■ (Go — >■ G/3) a a a ^ (3 

(a — >■ /3) — >■ {{(3 — >■ 7 ) — >■ (a — >■ 7 )) Go; /3 



3 Modal Paraconsistency in Rule-Based Formulation 



Further generalization arises from weakening the principles at stake by rewriting 
schemata as rules. Definition 3 becomes: 



Definition 4. Letp > 1 and S = {ai , . . . , Op} (all a(s are non-modal schemata) 
he such that 



o.\ 






_L 



is admissible in classical logic 



For n> 1, a logic is S]-paraconsistent iff it fails to have 



[G"5] 



□"oi 






G"/3 



as an admissible rule. 

Abbreviations. Let I denote the set of all (non-modal) admissible rules in clas- 
sical logic that have the form: 

0C\ . . . (Xp 

I 

Also, C will be taken to denote the set of all finite sets of non-modal schemata 
S = {oi, . . . , Op} such that 

(Xi ... (Xp 

I ^ 

There is an obvious connection between modal paraconsistency in rule-based 
formulation and modal paraconsistency based on conditional formulae: 
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Property 13. Let oi, . . . , be finitely many formulae obeying oiA. . .Aop G A. 
If a logic which admits modus ponens and 

the rule of adjunction 

a A p 

is [□"{«!, . . . , ap}]-paraconsistent then it is □"(oi A ... A ap)-paraconsistent. 

In view of Property 13, the comment at the end of Example 1 extends to 
modal paraconsistency in rule-based formulation: Many well-known modal logics 
fail to be [□”S']-paraconsistent for S = {-’(a? -A {(i ^ «))}> or S' = {a, -•«}, . . . 

Property 14. Let n > 1. If a logic which admits [RT] is S\-paraconsistent 
then it also is S]-paraconsistent for all m < n. 

The sufficient condition (in Property 14) for [□"S]-paraconsistency to be 
stronger than [□™S]-paraconsistency has a special case which is a sufficient 
condition for [□"S]-paraconsistency to be unique whatever n: 

Property 15. A logic which admits both [RT] and [i?4] is S]-paraconsistent 
in some m> 1 iff it is S]-paraconsistent in all n> 1. 

Indeed, Property 14 is the counterpart of Property 2 and Property 15 is the 
counterpart of Property 3. 



Notation. From now on, we write S<g and Sip to denote two sets in C which consist 
of exactly the same schemata except for some {4>i , . . . , 4>k} and . . . , tfk}. 
I.e., there exists some (possibly empty) S' such that Sp = S' U {fi , . . . , 4>k} and 
Sp = S' U {V’l, ■ . ■ ,^fc}. 



Property 16. Let n>l. Consider a logic admitting the rule: 



□a 



a 



provided that — is admissible 
P 



For all i = l..k, let the logic further admit the family of rules: from (fi, infer ipi. 
If it is [U" Sp]-paraconsistent then it also is Sp]-paraconsistent. 



In contradistinction with Property 4, Property 16 shows that the conditions 
for [□"S'.fJ-paraconsistency to be stronger than [□"S'^.J-paraconsistency do not 
resort to a common rule with the conditions for [□"S']-paraconsistency to be 
stronger than [□'"S'J-paraconsistency (cf [RM] in Property 1 and Property 4). 



Property 17. Let n > 1. Consider a logic admitting the rule: 

□ct ex j3 

— - provided that — and — are both admissible 

□/3 Pa 



Whenever the logic also admits the rules: 



pi 



, Pi 

and — 



1< z < A: 



it is [U^ Sp]-paraconsistent iff it is [O^ Sp]-paraconsistent. 
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The previous property can easily be expressed for classical equivalence: 



Property 18. Consider a logic that is an extension of classical logic and admits 
the rule: 

□ q; q; ^ 

— - provided that — and — are both admissible 

□/3 13 a 

For every n > 1, for every <P and W such that (j>i and xfi are classically equivalent 
when i = l..k, the logic is S,p]-paraconsistent iff it is S^]-paraconsistent. 

Lastly, here is a condition which makes [□"5]-paraconsistency to be unique: 



Property 19. A logic that is an extension of classical logic and admits the rule: 



□a 



a 



provided that — is admissible 
P 



fails to be [O'^ S]-paraconsistent for all n> 1 and for all S € C. 

There are no counterparts to Property 10 (or Property 12) because the rule- 
based formulation adopted now makes [□”S']-paraconsistency to capture most 
of the notion of □"-trivialization, which can then be dispensed with. 



4 Arbitrary Modal Paraconsistency 

A further step can be taken against generalization, by considering distinct modal- 
ities to be mixed in the formulation of modal paraconsistency. We must introduce 
a convenient way of describing the modalities we are to deal with: 

W = -.*(□-.*)+ 



Definition 4 becomes: 



Definition 5. Forp > 1, let S = {oi, . . . , ap} (all a(s are non-modal schemata) 
be such that 

is admissible in classical logic 

For uji € yV U {e} (e denotes the empty string) where 1 <i < p, for Wp+i G W, 
a logic is [{u: j} j S]-paraconsistent (1 < j < _p -I- 1) iff it fails to have 



[{Wilts'] 



. . . ojpOtp 
OJp+iP 



as an admissible rule. 



Example 3. The example in the introduction about the modal logic K can now 
be captured through expressing that it is a [{□□, 0}{a, ~'Q;}]-paraconsistent 

logic while it is not [{□, □, □}{«, -'a}]-paraconsistent. 




550 P. Besnard and P. Wong 



To illustrate that there is no special problem in this generalization, we may 
consider how Property 14 can be reformulated accordingly. Let S be just as in 
Definition 5, taking {uJi}i and {uj'iJi to be two indexed families of modalities. 
Then, for each i, 1 < i < p, the corresponding [RT] rule would be: 

UJiCXi 

Define 

W = {uJi}i U {wp+i} W' = U {wp+i} 

Now the corresponding property says that if a logic is [bPS'J-paraconsistent then 
it is also [IP'S'J-paraconsistent. 

Example 4- As to the epistemic version of the lottery paradox, a logic tolerating 
it would be paraconsistent with respect to the following expression: 

a O-ia 

^ D/3 

Example 5. All sensible modal logics are M}{a, -io;}]-paraconsistent 

where M is any modality (it can even be empty). This simply reflects 

o-ia o a 

which is certainly an indispensable feature of modal logics. The moral here is that 
the expressiveness of Definition 5 requires to pay some attention to what cases 
induce a meaningful instance of modal paraconsistency (even though technically 
no problem arises at all). 

Summing up, a special form of paraconsistency can arise in modal logics - 
even some of those that would not be paraconsistent if it were not from the 
presence of modalities. Although investigating the ex falso in the form of a 
schema is enough for logics that admit modus ponens and the deduction theorem. 
Sect. 4 indicates how to take care of other logics. Among them are logics with 
connectives different than classical ones. It is unproblematic as long as they 
conform to a truth-functional many-valued system: It suffices to restrict the 
range of these truth-functional connectives to the classical truth- values to obtain 
connectives of classical logic (e.g., Sheffer’s stroke, . . .) and form any classical 
antinomies using these connectives. 

5 Concluding Remarks 

Although no brevity or minimality requirement is imposed on A (or similarly R 
or C or I), it should be clear that any item in these sets which is more complex 
than necessary fails to induce any interesting kind of modal paraconsistency: 
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Regarding the schemata in A for instance, aA->aAf3A->P is not worth considering 
due to a A -•« G A or (3 A -'f] G A. 

At no point did we consider that modal paraconsistency might be based on 
schemata of the form □”(i^ — >• (3) as opposed to □”(/? — >■ □”/?. Why? If modal 
contexts are to be of any use with respect to paraconsistency, modalities such as 
□” are expected to isolate (so to speak) a class of conclusions from others. Thus, 
the class should be immune from inconsistencies spreading but such is not the 
case with □"((/? — >• /?): The non-modal logic must already be paraconsistent or 
the modal context □” is not to govern a class of conclusions resisting deductive 
breakdown. 

An implicit assumption throughout the text is that whenever the modal logic 
under consideration has its language which is not a classical one extended by 
a modal operator, the obvious adjustements are taken: E.g., replacing A by 
its intersection with the modal language when disjunction or some other usual 
connective is missing. Similarly if there is more than one negation . . . 
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Abstract. This paper combines ideas from argumentation [1,8] with 
desires and planning rules, in order to give a formal account of how con- 
sistent sets of intentions can be obtained from a conflicting set of desires. 
We show how conflicts may arise between desires and we resolve them. 
We argue that the set of desires can be clustered in three categories: i) 
the intentions of the agent, ii) the rejected desires and iii) the desires in 
abeyance. Finally, we show that the use of argumentation with desires is 
different from the usual kind of argumentation with default rules. 



1 Introduction 

An increasing number of software applications are being conceived, designed, 
and implemented using the notion of autonomous agents. These applications 
vary from email filtering, through electronic commerce, to large industrial appli- 
cations. In all of these disparate cases, however, the notion of autonomy is used 
to denote the fact that the software has the ability to decide for itself which 
goals it should adopt and how these goals should be achieved. 

Different architectures have emerged as candidates for studying these agent- 
based systems [2,3,4,6,12,10,11]. One of these architectures regards the system 
as a rational agent adopting certain mental attitudes: the beliefs (B), the desires 
(D) and the intentions (I). An agent can have contradictory desires. However, 
its intentions are a coherent subset of desires which the agent is committed to 
achieve. 

In [5], the authors explored principles governing rational balance among an 
agent’s beliefs, goals, actions and intentions. In [10] Rao and Georgeff showed 
how different rational agents can be modeled by imposing certain conditions on 
the persistence of an agent’s beliefs, desires or intentions (the BDI model). In 
decision theory. Pearl [9] illustrated how planning agents are provided with goals 
- defined as desires together with commitments - and charged with the task of 
discovering (or performing) some sequence of actions to achieve those goals. 

Most of formalizations are sophisticated enough to handle many aspects of 
BDI agents. However, they do not show how agent’s intentions are calculated 
from the whole set of its desires. In other words, it is not clear how an agent 
chooses a subset of its possibly contradictory desires. 
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Inspired from work on argumentation theory, we present in this paper a 
framework for handling contradictory desires. We will show that the problem 
can be formulated exactly as in argumentation theory and then we show that 
we can use all the techniques of argumentation to resolve it. 

We suppose that an agent is equipped with a base of desires, a base of plans 
to carry out in order to achieve the desires (we are not interested in the way in 
which these plans are generated), and finally a knowledge base. Then we show 
how conflicts can emerge between desires, how to formalize these conflicts and 
finally how to solve them to obtain the intentions of the agent. Let’s illustrate 
our purposes by the following example. 

Example 1. X is an agent who has the two following desires: 

1. To go on a journey to central Africa, (jca) 

2. To finish a publication before going on a journey, (fp) 

Let’s suppose that the following plans to achieve the above desires are generated: 

t A vac — >■ jca 
w fp 

ag ^ t 

fr ^ t 

hop vac 

dr — >■ vac 

\ 

with: t = “to get the tickets”, vac = “to be vaccinated” , w = “to work”, ag = 
“to pass to the agency”, fr = “to have a friend who may bring the tickets”, hop 
= “to go to a hospital”, dr = “to go to a doctor”. 

For example, the rule t A vac — >■ jca means that to go on a journey in central 
Africa, the agent X should get tickets and should be vaccinated. The rule w ^ fp 
expresses that the agent should work in order to finish his paper. To get tickets, 
the agent can either pass to an agency or ask a friend of him to get them. 
Similarly, to be vaccinated, the agent has the choice between going to a doctor 
or going to a hospital. In these two last cases, the agent has two plans for each 
action. 

Note that getting tickets and being vaccinated become two sub-desires of the 
agent with each one having its own plans. 

In addition to the set of plans and the set of desires, the agent may have 
also another base containing its knowlege and some integrity constraints. In our 
example, we have: 



J w — >■ ^ag 

( w — >■ ~^dr 

These two rules means that if the agent works, he can neither pass to an agency 
nor go to a doctor. Consequently, the plans of its two initial desires are conflict- 
ing. 

Of course, it would be ideal if all the desires can become intentions. As our 
example illustrates, this is not always the case. In this paper we will answer to 




554 L. Amgoud 



the following questions: which desire will become an intention of the agent, and 
with which plan? 

2 Argumentation Frameworks 

Argumentation is a reasoning model based on the construction of arguments and 
counter-arguments (or defeaters) followed by the selection of the most acceptable 
of them. 

In Dung’s work [7,8], an argumentation framework is defined as a pair con- 
sisting of a set of arguments and a binary relation representing the defeasibility 
relation between arguments. Here, an argument is an abstract entity whose role 
is only determined by its relation to other arguments. Then its structure and its 
origin are not known. 

Definition 1 ([8]). An argumentation framework is a pair < A, TZ > where 
A is a set of arguments and TZ is a binary relation representing a defeasibility 
relationship between arguments, i.e. TZ Q A y. A. (A, B) G TZ or equivalently 
”A TZ B” means that the argument A defeats the argument B. We also say that 
A and B are in conflict. 

An argumentation framework is finitary iff for each argument A there are finitely 
many arguments which defeat A. 

Definition 2. Let < A, TZ> be an argumentation framework, and S C A. 

— S is conflict-free iff ^ A, B G S such that A TZ B. 

— S defends A iffM B G A, if B TZ A then 3 C G S such that C TZ B. 

Each defeasibility relation leads to an argumentation framework. Defeating ar- 
guments can in turn be defeated by other arguments so we need to define a 
notion of the status of arguments. This notion of status is the central element 
of any argumentation framework. Its definition takes as input the set of all pos- 
sible arguments and their mutual relations of defeat, and produces as output a 
division of arguments into three classes of arguments: 

— The class of acceptable arguments. They represent the ’’good” arguments. 
In the case of handling inconsistency in knowledge bases, for example, the 
formulas supported by such arguments will be inferred from the base. 

— The class of rejected arguments. They are those arguments defeated by ac- 
ceptable arguments. Such arguments would not be considered in the process 
of inference from a knowledge base, for example. 

— The arguments which are neither acceptable nor rejected are gathered in the 
so-called class of arguments in abeyance. 

Note that to define the rejected arguments and the arguments in abeyance of 
a given argumentation framework, we first need to determine the set of accept- 
able arguments of that framework. For that purpose Dung suggested several 
semantics: basic semantics and preferred semantics. 
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Definition 3. Let < A, TZ > be an argumentation framework, and 5” C 

— S is a preferred extension iff S is maximal (for set inclusion), conflict-free 
and it defends all its elements. 

— S is a basic extension iff S is the least fixpoint of the function iF{S) = {A G 
A — A is defended by b”}. 

3 Basic Definitions 

In this section, we present the basic notions of a framework for handling an 
agent’s desires: a logical language, a desire, an action, a realization tree of a 
desire and finally defeasibility between actions. 



3.1 Logical Language 

Let £ be a propositional language, h denotes classical inference and = denotes 
logical equivalence. Each agent is equipped with three bases <T>, V, S> such 
that: 

— V contains formulas of C. The elements of V represent the initial desires of 
the agent. For example, an agent may have the following desires: to finish a 
publication, to go to a dentist, etc... Note that the set T> may be inconsistent. 
This means that an agent is allowed to have contradictory desires. 

— V is considered as a base of plans. It contains formulas having the form 
ipi /\ . . . /\ (pn — f h where ipi, . . ., (p„, h are literals of C. Such a formula 
means that to achieve h, the agent should realize pi, . . ., p„. 

— S contains formulas of C. They represent the knowledge of the agent and 
some integrity constraints. 



Example 2. In example 1, the agent has the following bases: 



V = {jca, fp}, V = { 



A vac — >■ jca 


w 


fa 


ag 


— y t 


fr 


— y t 


hop 


— >■ vac 


dr 


vac 



and S = 



mg 

-'dr 



3.2 The Notion of Desire 

A desire is either an element /i of 2? or a sub-desire of /i. In example I, the desire 
of going on a journey to central Africa has two sub-desires which are: 

— getting the tickets 

— being vaccinated 
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Let’s note that the sub-desires are not in the base T>. Formally: 

Definition 4 (Desire). A desire is: 

— a formulae h € T). 

— a formulae h such that ... A ipn^hGVUE 

— a formulae h such that 3 (pi A h ... A (pn ^ h' G V . In this case, h is called 
a sub-desire of h' . The function Subdesire(h’) returns the set {h\ (pi, . . . (/?„} 
of the sub-desires of h' . Note that each desire is a sub-desire of itself. 

Example 3 . In example 1 , the agent has two desires (jca, fa) and several sub- 
desires like: w, tb, vac, ag, fr, hop and dr. 

3.3 The Notion of Action 

As noted above, a desire may have a plan to achieve it. We bring the two notions 
together in a new notion of action. 

Definition 5 (Action). An action is a pair a =< h, H > such that: 

— h is a desire. 

— If 3(^1 A ... A ipn ^ h G V A S then H = {(pi, . . . , Lpn} else = 0 . 



The function Desire(a) = h returns the desire of an action a and the function 
Plan{a) = H returns its plan. 

We denote by H the set of all actions which may be constructed from the triple 
<V, V, N>. 

Remark 1 . — 

— If the set H (the plan) is empty, this means that the desire has not yet gotten 
a plan to achieve it or the desire is atomic which means that it does not need 
any plan to be achieved. 

— A desire may have several plans to be achieved. In this case, we have as 
many actions for that desire as plans. 

Example 4 . In example 1 , we have the following actions: a\ = <jca, {t, vac}>, 
02 = <fa, {w}> and 03 = <?c, 0 >. Action 03 means that: 

1 . The desire of working is atomic or, 

2 . The plan of working is not given yet. 

The realization of an action, ie the execution of its plan, requires in certain 
situations the decomposition of this plan. Each element of the initial plan gives 
place to a new action with a new plan. We call this kind of actions sub-actions. 
Formally: 

Definition 6 (Sub-action). Let oi and 02 be two actions o/H. oi is a sub- 
action 0/ 02 iff Desire{ai) G Plan{a2). In other words, Desire(ai) is a sub- 
desire of Desir 6(02). 
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Example 5. In example 1 , the action a\ = <jca, {t, vac}> has the following 
sub-actions: ana = <t, {ag}>, am, = <t, {fr}>, ai2a = <vac, {dr}>, and am 
= <vac, {hop}> 

An action may have consequences. 

Definition 7 (Action’s consequences). Let a G H. 

Consequence(a) = { 4 >i : suchthatDesire{a) U Plan{a) U A h <f>i}. 

Example 6. In example 1 , the action 02 = <fp, {w}> has the following con- 
sequences: {-<ag, -'dr, fp, w} 

3.4 Conflicting Actions 

After a small study undertaken on the various conflicts which may exist between 
actions (desires), we think that there are four great families of conflicts. In fact, 
two actions oi and 02 may be conflicting for one of the following reasons: 

— desire-desire conflict, ie Desire(ai) U Desire{a2) U if h _L 

— plan-plan conflict, ie Plan{a\) U Plan{a2) U if h _L. 

— consequence-consequence conflict, ie Consequence{ai)\JConsequence{a2) b 
_L. 

— plan-consequence conflict, ie Plan{ai)\JConsequence{a2) b _L or P/an(a2)U 
Consequence(ai) b _L. 

Example 7. Let’s consider an agent equipped with the following bases: T> = {fp, 
b, s}, if = {6 — >■ -<ws} and 

{ ws A wd — >■ fp 
c b 

-ic — >■ s 

with: 

ws = “to work Saturday” 
wd = “to work Sunday” 
fp = “to finish the paper” 
c = “to take the car” 
b = “to go to the beach Saturday” 
s = “to make savings” 

H contains the following actions: 

— Oi = <fp, {ws, wd}>. 

- 02 = <b, |c}>. 

- 03 = <s, |-'c}>. 

There exists a plan-plan conflict between 02 and 03. The two actions have 
contradictory plans here. 

The consequences of action 02 are (Consequence{a2) = {ps,->ws}). There 
exists a plan-consequence conflict between 02 and a\. There exists also a 
consequence-consequence conflict between oi and 02. Contrary to the conflict be- 
tween 02 and 03, the plans of ai and 02 seemed compatible, the conflict emerges 
from the consequences of the plans. 
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These different kinds of conflicts between actions are brought together in a 
unique relation of conflict defined as follows: 

Definition 8 (Conflict). Let ai and 02 be two actions o/H. a\ conflicts with 
02 ijf: {Desire(ai), Desire{a2)} U Plan{a\) U Plan{a2) U H h _L. 

Example 8. In example 1 , ona = <t, {ag\> conflicts with 02 = <fp, {'!c}>. 
Indeed, Plan(aua) U if h {~'w} and Plan(o2) = {w}, thus ono and 02 are 
incompatible. 

Property 1 . The conflicts relation is symmetrical. However, it is neither reflexive 
nor transitive. 

Example 9. Let’s consider an agent with a base of desires T> = {hi, / 12 , hfl\ and 
a base of plans. ■ 

{ a — y hi 

b ^h 2 
“■a A ~'b — >■ /13 

Let’s suppose that H = 0. oi, 02 and 03 are actions such that: oi = <hi, {o}>, 
02 = <h2, {&}> and 03 = <h^, {-■o, -■6}>. 

Oi conflicts with 03 and 03 conflicts 02. However, oi does not conflicts with 02- 
Remark 2. An action may conflict with itself. 

Let’s consider the following example: 

Example 10. Let V = {-•o'}, V ={oA6Ac— ^a'} and S = {o — >■ 0 '}. The 
action < -io',{o, 6, c} > conflicts with itself. 



3.5 Tree of Realization 

In example 1, the action 02 does not attack action oi. However, it attacks one 
of its sub-actions such as ono- In this case, we cannot say that the two desires 
Desire(ai) and Desire{a2) are realizable in the the same time. 

Thus, to check if a given desire is realizable, the corresponding action and 
all its sub-actions must be taken into account. It is then necessary to introduce 
a new notion of a tree of realization of a given desire. 

A tree of realization of a given desire d is an AND tree. Its nodes are actions 
and its arcs represent the sub-action relationship. The root of the tree is an 
action for the desire d. 

It is an AND tree because all the sub-actions of a given action must be carried 
out. When for the same desire, there are several plans to carry it out, only one 
plan is considered. Formally: 

Definition 9. A tree of realization g of a desire h is a finite tree such that: 



< h, H > is the root of the tree. 
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~ A node < h' . . . ,Lpn\ > has exactly n children ^ >, < 

>. 

— The leaves of the tree are atomic actions. 



The function Root{g) = h returns the desire of the root. The function Nodes{g) 
returns the set of all the actions of the tree g. 

t/(H) denotes the set of all the trees of realization that we can build from the set 

H. 

Remark 3. By analogy with argumentation theory, a realization tree can com- 
pared to an argument. A realization tree is built to achieve a given desire whereas 
an argument is built to support a given conclusion. A conclusion may be sup- 
ported by several arguments and similarly, a desire can have several trees of 
realization. This case arises when one of its sub-desires has several plans. 

Example 11. In example 1, the desire ”jca” whose action is oi = <jca, {t, 
vac}> has four trees of realization as shown in Fig. 1. 



Ai 

<voy, {b, vac}> 




Alla A 12a 

<b, {ag)> <vac, {med}> 



Ai 

<voy, {b, vac}> 




Alla A 12b 

<b, {ag)> <vac, {hop}> 



Ai 



<voy, {b, vac}> 




Allb Al2a 

<b, {a}> <vac, {med}> 



Ai 

<voy, {b, vac}> 




Allb Al2b 

<b, {a}> <vac, {hop}> 



Fig. 1. Trees of realization 



Since the actions may be conflicting, two realization trees may then be con- 
flicting too. That relation between realization trees will be called “Attack” rela- 
tion. 

Definition 10 (Attack). Let g\, g 2 G G{^)- 5i attacks g 2 iff^ai G Nodes(gi) 
and 3o2 G Nodes{g 2 ) such that ai conflicts 02- 

Property 2. The relation attack is symmetrical. 

Let’s now consider the following example: 



^ If a desire has several plans to carry it out, only one is considered in a tree. 
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Example 12. Let’s suppose an agent equipped with the following bases: T> = 
{c}, S = th and V = {a A 6 — >■ c, e — >■ a, c? A g — >■ 6, -le — >■ (/, a; — >■ fi}. 

The realization tree of the desire c contains the following actions: < c, {a, b} >, 

< a,{e} >, < b,{d,g} >, < g,{~>e} >, < d,{x} >, < e,0 >, < -le, 0 > and 

< X, 0 >. 

It is clear that < a, {e} > conflicts < g, {~<e} >. 

This example shows clearly that a realization tree of a desire can attack itself. 
Such trees are said self-attacked realization trees. 

They can be compared to self-defeating arguments. Like conclusions sup- 
ported only by self-defeating arguments cannot be inferred from a knowledge 
base (for example), it is obvious that a desire whose trees of realization are all 
inconsistent is a rejected one. This means it is impossible to carry out such desire. 



4 A Formal System for Handling Desires 

From the preceding definitions, we can now introduce a formal system for han- 
dling conflicting desires of an agent. 

Definition 11 (System for handling desires). Let’s consider a triple <T>, 
V, S>. 

A system for handling desires (SHD) is a pair < t/(H), Attack > such that 
t/(H) is a set of realization trees and Attack is a binary relation representing the 
defeasibility relation between the realization trees (Attack C t/(H) x t/(H)j. 

As in argumentation theory, we partition the set t/(H) into three cateories of 
realizations trees: 

— The class of acceptable realization trees. They represent the good plans to 
achieve their corresponding desires. That desires will become the intentions 
of the agent. 

— The class of rejected realization trees. They are those attacked by acceptable 
realization trees. 

— The class of realization trees in abeyance which gathers the realization trees 
which are neither acceptable nor rejected. 

In what follows we give the different semantics of ’’acceptable realization trees”. 
We first start by giving new definitions of conflict-free and defence in the context 
of conflicting desires. 

Definition 12 (Conflict-free). Let < f/(H), Attack > be a SHD and S C t/(H). 

5 is conflict-free ijj: ^ gi and g 2 in S such that g\ attacks g 2 . It follows that if 
S is conflict-free then \/gi € S, gi is not self-attacked. 

As we can see, the notion of conflict-free is very similar to the one in argumenta- 
tion theory. However, it is not the case for the notion of defence. The semantics 
of a realization tree is a complete plan to achieve a desire and our aim is to 
achieve a maximum of desires. The idea is if a given desire di can be achieved 
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with a plan gi then if another plan g2 for the same desire attacks a plan g^ of 
another desire d,2 , we will accept 53 to enable the agent to achieve its two desires. 
Formally: 

Definition 13 (Defence). Let < Q()\), Attack > be a SHD, S C t/(H) and g 
G S defends g iffy g' G t/(H) s.t g' attacks g 3 g” € S s.t Desire(g”) = 

Desire (g’). 



Preferred Extensions 

Using the new definition of defence, we show the following property: 

Property 3. Each SHD has at least one preferred extension. 

Let S\, Sn be the different preferred extensions of a SHD. For a given set of 
trees, the function Desires(Si) = {h € V such that 3 g £ Si and Root{g) = h} 
returns the different desires of these trees. 

Property 4- Let < Attack > be a SHD and Si, ..., S'„ the corresponding 

extensions. The sets Desires(Si), ..., Desires{Sn) are not always maximal (for 
set inclusion). 

Let’s consider the foilwing example: 

Example 13. In example 1, we can construct 5 realization trees. Four of them 
correspond to the desire jca. Let’s note them gi, g2, gs, 54. The fifth one corre- 
sponds to the desire fp and it is denoted by g^. 

In this case, we have exactly two extensions: 

- Si = {31,32, 33,54} with Desires(Si) = {jca} 

- S2 = {34,35} with Desires{S2) = {jca,fp} 

Note that Desires(Si) C Desirs{S2). 

The goal is to carry out the maximum of desires of the agent. In the preceding 
example, the agent can carry out its two desires in extension S'2. Thus among 
all the preferred extensions, we keep only those which carry out a maximum of 
desires (for set inclusion). Let denote them by Si, ..., Sj. 

Example 14. In example 1, the two desires of the agent become intentions. 
However, to achieve these intentions, the agent should use the plans of the trees 
{34,35}- 

Preferred extensions give us different sets of desires which may be achieved 
together. There is no preference between the extensions. 



Basic Extension 

The basic extension characterises the set of acceptable realization trees by a 
function T that returns for each set of realization trees the ones that are defended 
by that set. 
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Theorem 1. ~ 

— Let < Q{^), Attack > be a SHD. The function T is monotonic (for set 
inclusion). In other terms, if S, S' C t/(H) such that S C S' then iF{S) C 
J^{S'). 

— The function T has a least fixpoint: S 

— S is the basic extension of the SHD. 

When the system SHD is finitary (each realization tree is attacked by a finite 
number of realization trees), the function T is continuous. In that case, the least 
fixpoint of T can be obtained by iterative applications of the function T to the 
emptyset. 

Theorem 2. Let < Q()\), Attack > be a SHD. 

— < Attack > is finitary. 

— T is continuous. 

— The least fixpoint of if is: S = Ui>i = C U lJi>i ^*(^))- ^ denotes the 

set of non-attacked realization trees. 

Unlike in argumentation theory, the function T does not preserve the conflict-free 
property. In other words, if S is conflict-free, it is not always the case for Ttys'). 
This phenomenon occurs in a very particular case when the two defeasible trees, 
let’s say g\ and are both other alternatives of their corrsponding desires. In 
this case, the second alternatives are both in the set S. 

Theorem 3. — 

— The set S* = S \ {g,g' € S s.t g attacks g'} is conflict-free. 

— The intentions of the agent are gathered in the set I = {Root(gi) — gi € 
S*} 

5 Conclusion 

This paper introduces a formal model for computing the intentions of an agent 
from its set of possibly contradictory desires. We showed how and when desires 
can conflict. We formalized this concept of conflicts and then we showed how to 
solve them inspired from work on argumentation theory. 

More work, of course, remains to be done in this area. Particularly, we shall 
study the meaning of other semantics in argumentation like stable extension and 
complete extensions in our context. In [1], the authors have proposed a proof 
theory testing whether a given argument is acceptable (using the basic seman- 
tics). We shall use those results to test if a given realization tree is acceptable 
and consequently its desire is an intention. 

We shall also enrich the model by introducing preferences between the desires. 
This will help to refine the classification of the desires by leaving less desires in 
abeyance. We can imagine two sources of preferences. The first one is the agent 
itself. This means that an agent can have preferences over its set of desires D. 
In this case, if there is a conflict between two actions, we keep the one whose 
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desire is mostly preferred by the agent. The second source of preferences is 
argumentation. In this case, two actions Oi and 02 can be in conflict but one of 
them can have a good reason {argument) to be carried out. We are currently 
investigating these matters. 
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Abstract. A sound and complete sequent calculus for skeptical con- 
sequence in predicate default logic is presented. While skeptical conse- 
quence is decidable in the finite propositional case, the move to predicate 
or infinite theories increases the complexity of skeptical reasoning to be- 
ing i7i -complete. This implies the need for sequent rules with countably 
many premises, and such rules are employed. 



1 Introduction 

Skeptical consequence is a notion common to all forms of nonmonotonic reason- 
ing. Every nonmonotonic formalism permits different world views to be justified 
using the same set of facts and principles; the skeptical consequences of a frame- 
work are the notions common to all world views associated with that framework. 
Our purpose in this paper is to present a Gentzen-style sequent calculus (incor- 
porating some infinitary rules) which will allow us to deduce the skeptical con- 
sequences of a given framework. Such sequent calculi (with purely finite rules) 
were defined for several types of nonmonotonic systems by Bonatti and Olivetti 
in [1], but they restricted their attention to finite propositional systems for which 
skeptical consequence is decidable. We will adapt and extend their systems to 
accommodate infinite predicate systems. 

We will focus on default logic (Reiter, [2]) in this extended abstract, but there 
are also versions for stable model logic programming (Gelfond and Lifschitz, [3]), 
and autoepistemic logic (Moore, [4]) in the full version of this paper. In all three 
cases, when one steps from the finite and propositional to the predicate and po- 
tentially infinite, finding the set of skeptical consequences of a framework goes 
from being decidable to being TTj^-complete, at the same level of the computabil- 
ity hierarchy as true arithmetic. This result was proved for stable model logic 
programming by Marek, Nerode, and Remmel in [5], but it translates to the 
other systems quite easily. 

Members of III sets correspond to finite-path computable (or iT°) subtrees 
of in a very natural way. See [6] for an excellent exposition. This makes 

* This paper grew out of the author’s dissertation, written under the direction of Anil 
Nerode. 
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skeptical consequence a natural fit for sequent calculi with infinitary rules, since a 
sequent proof is, at its core, a finite-path tree. Bonatti and Olivetti also addressed 
credulous consequence ( “Can this notion be a part of some world view?” ) in their 
paper, but in the cases they were interested in, this question was also decidable. 
In our more general context, credulous reasoning is A'J-complete, not a natural 
type of question to address with trees-as-proofs. (One could write a sequent 
calculus for credulous reasoning in logic, but this would take us too far 
afield.) 



In default logic, there has been debate since Reiter first defined the frame- 
work in 1980 about how to treat unbounded variables in the rules. Reiter ([2]) 
advocated treating open variables (at least in the negative premises of a default 
rule) in the same way that they are treated in logic programs: as abbreviations 
for the same rule with each possible ground term of the language substituted. 
This has the result of turning a finite default theory into an infinite grounded 
theory; this is the definition we will use in this paper. However, there is some 
quite justified criticism of this approach. Under these definitions, the default 
; ]\/f P(x^ 

theory ( ’ — ,->P{a)) does not imply (Va;)[P(a;) O x a]. Lifschitz, in 

P[x) 

[7], defines extensions (the possible world views associated with default logic) 
relative to fixed domains. For finite theories and finite domains, everything is 
decidable, but over infinite domains this is no longer the case. In [8], it is shown 
that skeptical reasoning over countable domains using Lifschitz’ definition of ex- 
tension is Til'Complete, the same level as for predicate circumscription and thus 
beyond the scope of this paper. 

Because nonmonotonic logics deal not only with proof but with lack of proof, 
we will need not only standard monotone sequent calculi, but also rule systems 
for showing a lack of proof. We will call these antisequent calculi, using the 
terminology of Bonatti from [9]. While propositional provability and lack of 
provability are decidable, predicate provability is recursively enumerable, and 
hence predicate nonprovability is co-r.e. Just as sets do not lend themselves 
naturally to tree-based proofs, neither do co-r.e. sets. However, while Sl sets 
required a jump to Til logic, co-r.e. sets are easily accommodated in the III 
framework within which we will already be working. Bringing such enormously 
powerful logical machinery to bear on such a relatively simple problem may seem 
like overkill, but it works out quite naturally. The reader is thus warned that 
infinitary proofs will appear throughout the paper, even when talking about 
something as simple as lack of a standard predicate logic proof. 



2 Default Logic Preliminaries 

We will assume that the reader has some familiarity with classical propositional 
and predicate logic, including the notion of a Herbrand base for a given predicate 
language and the standard sequent calculus LK for predicate logic. 

Default logic is one of the most intuitive and widely-studied nonmonotonic 
formalisms. Default logics are built upon classical propositional or predicate 




566 



R.S. Milnikel 



logic, and we will assume the reader is familiar with the languages and the 
basics of axiomatic treatments of these logics. 

Definition 1 (Default). 

Let L he a predicate language. A default is a triple {ip, W, 0) where ip and 9 are 
formulas from L and \P = {'ipi, . . . jtpn} is a finite set of formulas from L. A 
default is usually written 

p : Mlpi,... ,Mlprn 



with the intended interpretation “if p is true and each i/'i is possible, conclude 
9 .” 

The formula p will be called the prerequisite of the default, T the justifi- 
cations, and 9 the conclusion. If a default contains formulas with unbounded 
variables, we will refer to the default as open. A default that is not open is 
closed. 

A default theory is a pair {D,W), where D is a set of defaults and W is a 
set of formulas of C. Note that we do not restrict D or W to being finite. If 
both D and W are finite, we will refer to {D, W) as a finite default theory. If D 
contains open defaults, we will call {D, W) an open default theory. If D consists 
entirely of closed defaults, we will call {D, W) closed. 



We will not want to work directly with default rules containing open formulas, 
but will look at a default which includes an open formula among its prerequisite, 
justifications, and conclusion as an abbreviation for the set of groundings of that 
formula. 



Let d 



p-,Mlfi,... ,Mlfn 



be an open default. The grounding of d, de- 



noted ground((i), will be the set of closed defaults obtained by replacing each 
unbounded variable x occurring in d by some ground term t of C, and doing so 
uniformly throughout the default. (If d is not open, let ground((i) = {d}.) Let us 
define the grounding of a set D of defaults to be the set of all defaults occurring in 
the grounding of some default in D. (So ground(D) = (J {ground (d)|d G D}.) We 
will occasionally abuse notation by writing ground((D, W)) for (ground(I?), W). 
Note that D = ground(D) if D is closed. One effect of grounding is that if £ con- 
tains an infinite number of ground terms, then even when D is finite, ground(D) 
might be infinite. 

It will simplify work we will do later to look at sets which contain both 
defaults and formulas. We can define the language as the union of all 
formulas in L with the collection of all defaults constructed from formulas of C. 
This will allow us to express the default theory {D,W) as the single set D U IT 
in 



Example 1. Let £ be a language with equality, one constant 0, one unary func- 
tion S, and a unary relation A. Let us examine the default theory {D,W) 
where IT is a classical theory with axioms for equality and the sentences 

: MA{x^ 

'ix,y{A(x) A Aiy) x = y) and 'ixix yf Sx). Let D = { }. The 

A[x) 
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classical part insists upon an infinite domain and says “A{x) holds of at most 
one cc,” while the open default chooses an x at random for which A(x) will hold. 

: MA{0) : MA{S0) : MA{SS0) 



The grounding of D is, of course, {- 



A(0) 



A(S'O) 



A(S'S'O) 



: MA{SSS0) ^ 
A{SSS0) 



Default rules without justifications (^) are monotonic and classical in na- 

0 

ture. It should not cause confusion if we conflate these with classical rules 

ip 

of inference of the form — . When such rules arise from considering a default 

0 



p : M'01, . . . 



in a context in which the ipiS are guaranteed to be possi- 



ble, we will call them residues. We will define analogously to as the 
formulas of C taken together with residues built from C. 

We will say that the closure of a set T C denoted C1(T), is the least set 
T of formulas of C which is deductively closed, FC\C C T, and has the additional 

p 

property that if — G T and p € T then 9 G T. We are now in a position to define 



reducts of default theories and extensions for default theories. 



Definition 2 (Reduct of a Default Theory). 

Let S he a set of formulas of C. 



1. A default 



p : Mipi, . . . ,M'ipn 

9 



is irrelevant with respect to S if G S for at least one 
2. Let r he a closed default theory (in The reduct of T with respect to 

S, denoted hy Ts, is obtained from T by: 

a) Removing all defaults that are irrelevant with respect to S. 

b) Replacing each remaining default 



p: Mfji,... ,M'ipn 
9 



with its residue 



9' 



What remains after taking the reduct Is of a default theory T is a residue 
theory. 

Definition 3 (Default Extension). 

Let r = D\JW he a closed default theory. We say that a set of formulas S of C 
is a default extension for T if S = C'Z(Is). We will say that S is an extension 
of open default theory T if S is an extension of ground{T). 

Example 2. Let us reexamine Example 1 in light of these two definitions. Let us 
choose a fixed n and let S = Th{W U {A{n)}) . (We use the convention that n is 
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n S"s followed by a 0.) From W and A(n), we can prove ~'A(rn) for all m ^ n, 
so the only default which is relevant in context S is 



and clearly Th{W U {A(n)}) = C\{W U {- 



, , , - . Its residue is 

A{v) 

A{nY - -W" - '~A{n)^'^' 

extensions of {D, W) will be {Th{W U {A{n)})\n G ut}. (If A fails for all x, then 
all defaults would be relevant, and A{x) would be derivable for all x, leading to 
a contradiction.) 



Definition 4 (Skeptical Consequence). 

A formula ip in predicate language L is in the set of skeptical consequences of 
a default theory {D, W) if for every extension S of {D, W), p € S. 

Example 3. We will continue to exploit Example 1. We saw in Example 2 that 
extensions of {D, W) are {Th{W U {A(n)})|n G co}. The only items common to 
all of these are the consequences of W along with the sentence 3xA{x), making 
Th{W U {3xA(x)}) the set of skeptical consequences of {D,W). There is no 
compactness theorem for skeptical consequence. By that we mean that although 
in each particular extension S of (D, W), some finite portion of the reduct of the 
theory was used to prove 3xA{x), there is no finite portion of ground(I?) which 
is responsible for 3xA{x) being present in all extensions. 



3 Skeptical Sequent Calculus for Default Logic 

Throughout this section, we will be working in a predicate language C without 
equality. (Our running example uses equality, but our use of equality is quite 
limited and could be accommodated by a defined equivalence relation.) 

The definition of an extension insists that certain formulas be derivable and 
that others not be derivable (to keep the defaults used in the derivations rele- 
vant). Thus, as we accumulate information about potential extensions by back- 
tracking through a sequent proof, we will find that we need to establish, at var- 
ious points, both derivability and non-derivability. Because of the compactness 
of classical logic, establishing derivability will be straightforward. Establishing 
non-derivability, on the other hand, will be quite complicated. However, when we 
limit ourselves to finite sets of premises, non-derivability is fairly straightforward 
to establish. 



3.1 An Antisequent Calculus for Predicate Logic 

Bonatti in [9] presented an antisequent calculus for propositional logic, and our 
antisequent calculus will be the one of that paper extended by four rules. The 
four rules we add to Bonatti’s formulation are counterparts of the four rules 
for quantifiers from Gentzen’s LK. We assume that the reader is familiar with 
Gentzen’s sequent calculus LK, and will here list only the rules pertaining to 
quantifiers. In the rules h V and 3 h, we must insist that x not appear as a free 
variable in T or in Z\. (This formulation is drawn from [10].) 
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Table 1. Quantifier Rules from the Sequent Calculus LK 



rv hi (\- VI -T h 

^ ’ r, {\/x)ifi{x) h A r h (yxMx) 

/q IN r, ifi(x) h A r h A,ip{t) 

^ ’ r, {3x)if>{x) h A r h A, (3x)^ix) 



Because of the way extensions of open default theories are defined, the general 
case of skeptical default reasoning for even finite predicate default theories will 
necessitate infinitary sequent rules. For this reason, we will not hesitate to bring 
the enormous III power of infinitary sequent rules to bear on our comparatively 
simple ilj* problem of showing that there is no proof in the sequent calculus LK 
of the sequent F \- A. 

An antisequent is a pair {F, A) of finite sets of formulas, denoted F F A. We 
will call F ^ A true if there is a model of F in which all of the formulas of A 
are false. We have the benefit of the Soundness and Completeness Theorems for 
LK, which tells us that F F A is true if and only if T h A is false if and only if 
F \- A is not derivable in LK. 

An antisequent F F A will be considered an axiom of our antisequent calculus 
if T U A consists entirely of atomic formulas and F fl A = 0. The rules for the 
antisequent calculus can be found in Table 2. 

The usual proviso that x may not be free in F U A applies to F V and 3 F. 

Showing that the antisequent calculus is sound is the counterpart to show- 
ing the classical sequent calculus LK complete, and vice versa. The classical 
theorems are relied on heavily in the proof of the following theorem: 

Theorem 1. Antisequent F F A is provable if and only if it is true. 

Example 4- Let us show that (Vx)(F(a;) V Q{x)) F {'dy){P{y)) V (Vz)(Q(z)) for 
unary relations P and Q. The following is a partial proof, with an infinite number 
of premises remaining at the top. 

{P{t) V Q{t) F P{y),Q{z)\t is a term of C} 

(Vx)(F(x)VQ(x))FF(i/),g(z) 

{yx){P{x) V Q{x)) F P{y), (Vz)(Q(z)) 

{yx){P{x) V Q{x)) F {yy){P{y)), {yz){Q{z)) 

(Vx)(F(x) V Q{x)) F (Vy)(F(y)) V (Vz)(g(z)) 

We are left with an infinite number of premises of the form P{t) V Q{t) F 
P{y),Q{z) to prove. By rules (*V F) and (V* F), we need to be able to show 
only either P{t) F P{y),Q{z) or Q{t) F P{y),Q{z). As long as t yf y, Pff) F 
P{y),Q{z) is an axiom. If t = y, Q{t) F P{y),Q{z) is an axiom. This provides 
us with proofs of all of the infinitely many premises of the form P{t) V Q{t) F 
P{y),Q{z) and completes the proof. 
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Table 2. Rules of the Antisequent Calculus 



(-.P) 


rp A,p 


(p -.) 


r,pP A 


—'ip F zi 


rp A,^p 


(AP) 




(P.A) 


rp A,:p 


r,p A ip A 


r P A, A ')/’ 






(PA.) 


rp A,i/> 






r P A, A ')/’ 


(.VP) 


r,p)d A 


(PV) 


rp a,v5,V> 


T, (p V P A 


r A,p\/ 


(V.P) 


A 






T, (p V P A 






(.^P) 


rp A,<p 


(p^) 


r,:pP A,^ 


r,p^i)!d A 


r A,p ^ ip 


(.^P) 


F,ip!d A 






A 






(VP) 


{r, p{t) P A\t is a term in £} 


A, p{x) 


r, (yx)p{x) P A 


V y 


r A, {\/x)p{x) 


(3P) 


r, p{x) p A 


(P3) 


{r P A, p{t)\t is a term in £} 


r, {3x)p{x) P A 


rp A, (3x)<p(a;) 



3.2 A Sequent Calculus and an Antisequent Calculus for Residues 

The sequent calculus for monotone proofs based on predicate logic will be the 
standard propositional sequent calculus LK extended by two rules for dealing 
with residues, very closely related to our one-rule sequent calculus for Horn 
programs. 

A residue sequent is a pair (T, A) where both F C and A<Z L are finite, 
and is usually written F \- A. We say that T h Z\ is true ii\J A £ C1(T). 

If we extend the classical predicate sequent calculus LK (restricted to C) by 
the following two rules about residues, we obtain a sequent calculus for residues. 

F'r A FV- ip F.O'r A 

r V |_ A r ^ p /\ 

This sequent calculus for residues was defined by Bonatti and Olivetti in [1], 
and they proved this theorem about it: 

Theorem 2. F \- A is derivable in the sequent calculus for residues if and only 
if it is true. 
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Their proof was based on the soundness and completeness of propositional 
rather than predicate logic, but the proof in either case is identical. 

We can also extend antisequents to deductions from residues. A residue an- 
tisequent will be a pair of finite sets T C and Z\ C £ written F A. And 
just as the residue sequent F \- A was considered true if V ^ C C1(T), we will 
consider the residue antisequent T F Z\ to be true ii\J A ^ C1(T). Just as we 
extended the classical sequent calculus (limited to C) by two rules to produce a 
sound and complete residue sequent calculus, we will also extend the predicate 
antisequent calculus (again limited to L) by the two rules below to produce a 
sound and complete residue antisequent calculus. 

rFZ\ TFv? F,9F A 

T, f F Z\ T, Z\ 

Bonatti and Olivetti proved the following in [1]: 

Theorem 3. Antisequent F 'F A is derivable in the antisequent calculus of 
residues if and only if it is true. 

Again, their proof was for a residue antisequent calculus built over proposi- 
tional rather than predicate logic, but the same proof works here. 



3.3 Skeptical Sequent Calculus 

One can think of Gentzen proof systems as failed exhaustive searches for coun- 
termodels. Thus, what we will want to do as we search for a countermodel to the 
claim “All extensions of default theory (U, W) must contain (p” is keep track of 
which formulas are in and out of our potential countermodel to the claim. We 
will want to make sure that all formulas we would like to see in our potential 
countermodel do, in fact, have proofs; and we want also to make sure that all 
formulas we plan to exclude do not have proofs. Finally, we will use our increas- 
ing information about the potential countermodel to determine which defaults 
will be dismissed as irrelevant and which will be retained as residues. 

The sequents for skeptical reasoning for default logic will be triples {S, F, A), 
usually notated S;F\^A. This notation is drawn directly from [1]. The sets F 
and A are relatively straightforward. T is a closed predicate default theory, 
and Z\ is a set of formulas of C, neither necessarily finite. The set S is more 
complicated. In Bonatti and Olivetti’s formulation, it was a finite collection of 
provability constraints of the form Lip or -<Lip where (j) G C. The intention was 
to suggest the modal operator L, and indicate whether p was in or out of the 
potential countermodel, as it exists so far. We will need to extend the notion of a 
provability constraint to be more explicit than “p can be proved.” We will need 
to be able to say “p can be proved from these specific rules.” Thus, in addition 
to provability constraints of the form Lp and -<Lp (which we will call implicit 
provability constraints), we will also include explicit provability constraints Lrp 
where T is a finite set of formulas from C. The intended meaning of Lrp is 
“F h (/?.” Together, explicit and implicit provability constraints will be known as 
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general provability constraints. We can now finally describe S as a, set of general 
provability constraints. 

We will say that a theory T C C satisfies L(p if (p G T and satisfies -•Lip if 
p ^ T. We will say that T satisfies Lpp ii p G T and in addition F \- p. 

The reader may wonder why we do not need explicit provability constraints 
of the form ->LrP- We will be making claims both of the form has a proof” 
and of the form has no proof” . To contradict a claim of the form “p has no 
proof,” it is necessary simply to exhibit a single proof of p. On the other hand, 
to contradict a claim of the form “(/? has a proof,” we must look at all possible 
proofs of p, that is all (relevant) explicit proof constraints. 

We will say that S] FY^A is true if every extension S' of T C which 

satisfies all constraints in F includes at least one member of A. Thus, p \s a 
skeptical consequence of the default theory {D, W) if 0; ground((I?, W))\^p is a 
true sequent. 

This calculus incorporates three sorts of sequents: residue sequents, residue 
antisequents, and the skeptical reasoning sequents just defined. The sequent 
calculus for skeptical reasoning will include all axioms and rules of the residue 
sequent and antisequent calculi, plus five new rules. (No additional axioms will 
be necessary. The leaves of every proof tree will be classical predicate sequent 
and antisequent axioms.) 

Definition 5 (Skeptical Sequent Calculus Default Logic). 

The axioms of the skeptical sequent calculus are classical predicate sequents F h 
A with T n Z\ yf 0; and predicate antisequents F 'F A with F U A all atomic 
formulas and F C\ A = %. The rules are: 



0 . 

1 . 

2 . 

3. 

4- 

5 . 



The rules of LK and the antisequent rules from Table 2, all limited to se- 
quents and antisequents in L; plus the two pairs of additional rules for residue 
sequents and antisequents. 

F' F' h A' 

’ — — where F' C is finite, F' C [{p\Lp G F}\j{p\Lr"P G F 

F] F\^A 

for some F"}), and A' C A is finite. 

y-}/ jnf I 

^ where F' C is finite, F' C ({p\Lp G F\VJ{p\Lr"P G 

-'Lp, F; F[^A 

F for some F"}). 

° ‘f,. — - where Fq C is finite. 

{LriP, F, F'o; F', (F \ F')Y^A\F' C F is finite} 

^ — T where 

Lp, F; F[^A 

. .p.p-Mfii,... ^ r r 



7 ^/ _ 

^0 - i^l 

p-,Mfili, . . . 



G F'} and Fq = {-•L-tiflif = fij for 



G F' and some 1 < j < n}. 



. . . , F; F, F\-^A ■ ■ ■ L-afn, N'; r|~^Z\ 

p-,Mipi,... . 
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The reader may recall from the introduction that we noted that infinitary 
rules of inference would be necessary in some cases. When applied to a sequent 
with infinite F, rule 4 has infinitely many premises. 

Let us examine what each of these rules accomplishes, thinking of ourselves as 
traversing a completed proof backwards, from conclusions to premises, examining 
each branch as a failed attempt to find a counterexample to the assertion made 
at the root of the tree. Just as we are moving backwards through the proof, let 
us also move backwards through the rules. 



Rule 5 says: “Either default 



is relevant or it is not. If it 



F 



is relevant, make sure that the context reflects that, and put the residue ^ into 



our list of usable rules. If it is not relevant, it must be because is in the 
context for some il’j” 

Rule 4 says: “If we are asserting that (p is in the extension we are trying 
to build, it must have a proof from the available clauses. Because in different 
extensions, it might have different proofs, we will need to examine each possible 
proof independently. If defaults from F are used, update the context to reflect 
their usability and replace them in F with their residues.” 

Rule 3 says: “If we have said that Fq\- (p and yet we can show that Fq Y- ip, 
show this and stop.” 

Rule 2 says: “If we have said that (p has no proof, and yet from what we 
already know about the potential extension we are building we can show that 
that extension must contain p, show this and stop.” 

Rule 1 says: “If from what we already know about the potential extension 
we are building, we can show that a member of A must be in that extension, 
show this and stop.” 



Example 5. We stated in Example 3 that 3xA{x) is in the set of skeptical con- 
sequences of default theory (ZJ, W) from Example 1. Let us now show how our 
skeptical sequent calculus would prove this. The sequent we want to prove, then 
is: 



: MA{Q) : MA{SQ) : MA{SSQ) 

A(0) ’ A(S'O) ’ A(S'S'O) 



\^3xA{x). 



We will prove this sequent by rule 5. The two premises we will need are: 



T A(^YW— ■ ^ MAjSSQ) 

and 



\^3xA{x) 



L-A(O); W 



: MA(SO) 
A(S0) 



: MA(SSO) 
A(SSO) 






These sequents will have very different proofs. We can prove the former by 

means of rule 2, since is among the residue clauses in the F of the sequent, 

A(0) 

and a residue proof of 3xA{x) from is trivial. The latter we will prove by 
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means of rule 4, which will have infinitely many premises, each of the form 

Lr^,S'o-,r',{r\r')^3xA{x) 



where 



r-W r ■ MA{S0) : MA{SS0) : MA{SSS0) ^ 

1(50) ’ 1(550) ’ A{SSS0) 

We will look at a few representative premises. In each case, writing out the 
sequent with Fq and the other terms expanded to fit the particular case would 
be unwieldy, so we will simply use the template above and refer to Fq and so 
forth by name. 



• MA(SSO) 

— In the case that F' consists of — — ~ 'ix,y{A{x) A A{y) x = 

y) plus enough of W to prove 550 yf 0, F^ will consist of and 

o o U j 

\/x,y{A{x) A A{y) x = y together with the rest of the portions of F' 
drawn from W. will consist of ~'L-'A{S SO) . 

In this case, the premise can be proved using rule 1, since ^ ^ ^ G Fq and 



A(550) 



h 3xA{x). 



A(550) 



: MA(SSO) 

— In the case that F' consists of — > ''^ 2 :, j/(A(a;) A A{y) x = y), 

• M A( S S so) 

and — ? plus enough of W to prove 550 yf 0, we could proceed 

just as in the above case. We do have another option open to us in this case, 
though, which will illustrate the use of rule 2. Fq will include both 

j 

and , while Fq will consist of both -iL-i^(550) and -<L->A{SSS0). 

o o o u j 

from Fq and the relevant portions of W from F \ F' , we 



wm 

can show that 



Vcr, y{A{x) A A{y) ^ x = y), 550 yf 5550, h -A(5550) 

and use rule 2 to prove the desired sequent. 

— In the case that F' is drawn entirely from W, it is not hard to show that 
Fq = F', and we can show Fq Y- -■41(0). We can then use rule 3 to prove our 
sequent. 

We have seen that we can derive at least some of the sequents which are rule 
4 premises using rules 1, 2, and 3. In fact, all of the infinite set of premises of 
our particular application of rule 4 can be proved using these three rules. One 
application of rule 4 gets us one of our two premises of our desired case of rule 
5, and the other was proved directly by rule 2. With one application of rule 5, 
our deduction is complete. 
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We conclude with the expected soundness and completeness theorem. 

Theorem 4. A default logic skeptical reasoning sequent S; F\^A is true if and 
only if it is provable. 

Proof. Due to limitations of space, only the barest sketch of a proof may be 
presented here. Proofs of the soundness of each of the five rules are straightfor- 
ward and independent of each other. To prove the adequacy of the rules listed 
to generate all true sequents, we show that any sequent which has no deduction 
is false. If we let S; F\^ A be a non-deducible sequent, we can build a failed 
attempt at a deduction which is guaranteed to have at least one branch not 
terminating in an axiom. Because of the nature of the five rules, only rules 4 and 
5 will be used along this non-terminating branch. The context developed along 
this branch expanding by those two rules will be a witness to the falsehood of 
S;F\^A. 

We have now seen a sequent calculus for skeptical reasoning in default logic; 
versions exist for stable model logic programming and autoepistemic logic. One 
obvious direction in which to extend this work would be to do the same for predi- 
cate circumscription, which would require a Fl^ sequent calculus. One important 
feature of the calculus presented above is that the assertion that (p simply has 
a proof is not enough. We must look at all possible proofs of (p. This necessity 
to take an assertion of the existence of a proof and explicate it with an actual 
proof strongly suggests a connection with Artemov’s logic of proofs (see [11]). 
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Abstract. In previous work, 1 have presented approaches to nonmonotonic proba- 
bilistic reasoning, which is a probabilistic generalization of default reasoning from 
conditional knowledge bases. In this paper, 1 continue this exciting line of research. 
1 present a new probabilistic generalization of Lehmann’s lexicographic entail- 
ment, called iexA-entailment, which is parameterized through a value A € [0, 1] 
that describes the strength of the inheritance of purely probabilistic knowledge. 
Roughly, the new notion of entailment is obtained from logical entailment in 
model-theoretic probabilistic logic by adding (i) the inheritance of purely proba- 
bilistic knowledge of strength A, and (ii) a mechanism for resolving inconsistencies 
due to the inheritance of logical and purely probabilistic knowledge. 1 also explore 
the semantic properties of Zexx-entailment. 



1 Introduction 

During the recent decades, there has been a significant amount of research in AI that 
concentrates on probabilistic reasoning with interval restrictions for conditional prob- 
abilities, also called conditional constraints [26]. The main focus of this research was 
especially on the computational aspects of probabilistic reasoning in model-theoretic 
probabilistic logic, which is a major approach for handling conditional constraints that 
can be traced back to Boole [8]. A wide spectrum of formal languages has been explored 
in model-theoretic probabilistic logic, ranging from constraints for unconditional and 
conditional events (e.g., [1,14,25,26,28,32]) to linear inequalities over events [12]. Prob- 
abilistic reasoning in model-theoretic probabilistic logic, however, is not the only way 
of handling conditional constraints. An alternative approach to probabilistic reasoning 
with conditional constraints is based on the coherence principle of de Finetti (e.g., [5, 
16,17]) and has been extensively explored especially in the field of statistics. 

Example 1.1. Suppose we have the knowledge “ostriches are birds”, “birds have legs”, 
“birds fly with a probability of at least 0.95”, and “ostriches fly with a probability of at 
most 0.05”. In model-theoretic probabilistic logic, we then conclude that both birds and 
ostriches have legs, and that birds (resp., ostriches) fly with a probability of at least 0.95 
(resp., at most 0.05). In coherence-based probabilistic logic, in contrast, we conclude that 
birds (resp., ostriches) have (resp., do not have) legs, and that they fly with a probability 
of at least 0.95 (resp., at most 0.05). □ 

* Alternate address: Institut fiir Informationssysteme, Technische Universitat Wien, Favoriten- 
straBe 9-11, A- 1040 Vienna, Austria; e-mail: lukasiewicz@kr . tuwien. ac . at . 

T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 576-587, 2003. 
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The relationship between model-theoretic and coherence-based probabilistic logic 
has recently been explored in [7]. In particular, it turned out that model-theoretic entail- 
ment is strictly stronger that entailment under coherence, while satisfiability in model- 
theoretic probabilistic logic is strictly weaker than consistency in probabilistic logic 
under coherence. Furthermore, model-theoretic probabilistic entailment is well-known 
to be a generalization of model-theoretic entailment in classical propositional logics, 
while probabilistic entailment under coherence is a generalization of classical default 
entailment from conditional knowledge bases in System P. 

Hence, it is natural to wonder whether there are probabilistic generalizations of other 
formalisms for default reasoning from conditional knowledge bases. 

The literature contains several different proposals for default reasoning from condi- 
tional knowledge bases and extensive work on its desired properties. The core of these 
properties are the rationality postulates of System P proposed by Kraus et al. [19]. It 
turned out that these rationality postulates constitute a sound and complete axiom system 
for several classical model-theoretic entailment relations under uncertainty measures on 
worlds. In detail, they characterize classical model-theoretic entailment under prefer- 
ential structures, infinitesimal probabilities, possibility measures, and world rankings. 
They also characterize an entailment relation based on conditional objects. A survey of 
the above relationships is given in [4]. 

Mainly to solve problems with irrelevant information, rational closure as a more ad- 
venturous entailment relation was proposed by Lehmann [23]. It is equivalent to entail- 
ment in System Z by Pearl [33], to the least specific possibility entailment by Benferhat 
et al. [3], and to a conditional (modal) logic-based entailment by Lamarre [22]. Finally, 
mainly to solve problems with property inheritance from classes to exceptional sub- 
classes, further formalisms were proposed, in particular, lexicographic entailment by 
Lehmann [24] and Benferhat et al. [2] and conditional entailment by Geffner [15]. 

Indeed, such formalisms for default reasoning from conditional knowledge bases 
can be generalized to the probabilistic framework of conditional constraints [29,30] (see 
Section 5 for more details on these formalisms and some of their applications): 

• In [29], I introduce probabilistic generalizations of Pearl’s entailment in System Z 
and Lehmann’s lexicographic entailment, which lie between model-theoretic and 
coherence-based probabilistic entailment. Roughly, the main difference between 
model-theoretic and coherence-based probabilistic entailment is that the former 
realizes an inheritance of logical knowledge, while the latter does not. Intuitively, the 
new formalisms now add a strategy for resolving inconsistencies to model-theoretic 
entailment, and a restricted form of inheritance of logical knowledge to entailment 
under coherence. This is why they are weaker than model-theoretic probabilistic 
entailment and stronger than coherence-based probabilistic entailment. 

• In [30], I introduce similar probabilistic generalizations of Pearl’s entailment in Sys- 
tem Z, Lehmann’s lexicographic entailment, and Geffner’s conditional entailment. 
They, however, behave quite differently from the ones in [29]. Roughly, model- 
theoretic probabilistic entailment realizes an inheritance of logical knowledge, but 
no inheritance of purely probabilistic knowledge. The formalisms in [30] now add 
an inheritance of purely probabilistic knowledge and a strategy for resolving in- 
consistencies (due to the inheritance of logical and purely probabilistic knowledge) 




578 



T. Lukasiewicz 



to entailment in model-theoretic probabilistic logic. This is why they are generally 
much stronger than entailment in model-theoretic probabilistic logic. 

In the present paper, I define a general approach to nonmonotonic probabilistic rea- 
soning, which subsumes the above two approaches [29] and [30] as special cases, and 
which also allows for nonmonotonic probabilistic reasoning between them. Roughly, the 
main idea behind this new approach is to add to model-theoretic probabilistic entailment 
(i) some inheritance of purely probabilistic knowledge that is controlled by a strength 
A G [0,1], and (ii) a mechanism for resolving inconsistencies due to the inheritance of 
logical and purely probabilistic knowledge. Based on this idea, I define a new probabilis- 
tic generalization of Lehmann’s lexicographic entailment. Other formalisms for default 
reasoning from conditional knowledge bases can be extended in quite much the same 
way (such an extension of Pearl’s entailment in System Z is included in [31]). The main 
contributions of this paper can be summarized as follows: 

• 1 present a new probabilistic generalization of Lehmann’s lexicographic entailment, 
which is parameterized through a value A G [0, 1] that describes the strength of the 
inheritance of purely probabilistic knowledge. For A = 0 (resp., A = 1), it coincides 
with probabilistic lexicographic entailment introduced in [29] (resp., [30]). 

• I show that probabilistic lexicographic entailment of strength A has similar properties 
as its classical counterpart. In particular, it satisfies the rationality postulates of 
System P and the property of Rational Monotonicity. 

• I also show that probabilistic lexicographic entailment of strength A is a proper 
generalization of its classical counterpart. Furthermore, it is weaker than some no- 
tion of logical entailment in model-theoretic probabilistic logic, and under certain 
conditions it coincides with this notion of entailment. 

Note that detailed proofs of all results are given in [31]. 

2 Preliminaries 

In this section, I define probabilistic knowledge bases. I then recall the notions of satis- 
fiability and logical entailment from model-theoretic probabilistic logic, and the notions 
of g-coherence and g-coherent entailment from probabilistic logic under coherence. 

2.1 Probabilistic Knowledge Bases 

I assume a set of basic events (p = {pi , ,pn} with n > 1. I use _L and T to denote 
false and true, respectively. 1 define events by induction as follows. Every element of 
<P U {_L, T} is an event. If <j) and f are events, then also -if and {f Af).A conditional 
event is an expression of the form f\(j) with events ip and f. A conditional constraint is 
an expression {ip\f) [(, u] with events ip, f, and real numbers l,uG [0, 1]. I define proba- 
bilistic formulas by induction as follows. Every conditional constraint is a probabilistic 
formula. If F and G are probabilistic formulas, then also -iF and {F AG). I use (F V G) 
and {F <J= G) to abbreviate -i(-iF A -■G) and -i{-iF A G), respectively, where F and G 
are either two events or two probabilistic formulas, and adopt the usual conventions 
to eliminate parentheses. A logical constraint is an event of the form ip^f. A prob- 
abilistic knowledge base KB ={L,P) consists of a finite set of logical constraints L 
and a finite set of conditional constraints P. 
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Example 2.1. The knowledge “eagles are birds”, “birds have legs”, and “birds fly with 
a probability of at least 0.95” can be expressed by the probabilistic knowledge base 
KB = {L, P) = {{bird eagle}, {{legs\bird)[l,l],{fly\bird)[0.95,l]}). Note that in 
model-theoretic probabilistic logic, 'ip-^(j)GL means the same as {ip\(j))[l, 1 ]gP, where- 
as in probabilistic logic under coherence and in probabilistic lexicographic entailment, 
L is strict, while (■0|</>) [1, 1]G P may have exceptions. □ 

Example 2.2. The knowledge “ostriches are birds”, “birds have wings with a probability 
between 0.65 and 0.75”, “birds fly with a probability of at least 0.95”, and “ostriches 
fly with a probability of at most 0.05” can be expressed by the probabilistic knowledge 
base KB = {L, P), where L = [bird <1= ostrich} and P = {{wings\bird)[Q.(!>ll, 0.75], 
{fly\bird)\fl.Q^, 1], {fly\ostrich)\}}, 0.05]}. □ 

A world / is a truth assignment to the basic events in (p (that is, a mapping I: <P ^ 
{true, false}), which is inductively extended to all events by /(_L) = false, /(T) = 
true, I{-'(f>) =true iff I{4>) = false, and I {{(j)A'tjj)) = true iff /((/>) = I{tp) =true. 
I use I,p to denote the set of all worlds for <P. A world I satisfies an event or I is 
a model of denoted I\=4>, iff I{4>) = true. I extend worlds I to conditional events 
'ip\(j) by = true iff I \=fi! Afi, I{ip\4>) = false iff I j= -'fit A cj), and = 

indeterminate iff / \= -tfi. A probabilistic interpretation Pr is a probability function 
on Z|> (that is, a mapping Pr ■. ^ [Q, 1] such that all Pr{I) with I gX^ sum up to 1). 

The probability of an event in Pr, denoted Pr{(f>), is the sum of all Pr{I) such that 
I GX 4 , and I \=4>. For events and ij) with Pr{4>) > 0, I write Pr{ip\(j)) to abbreviate 
Pr{fi A (f>) / Pr{(f>). The truth of logical constraints and probabilistic formulas F in a 
probabilistic interpretation Pr, denoted Pr ^ F, is defined as follows: 

• Pr \= (p iff Pr{ip A(f>) = Pr {({>)’, 

• Pr \= {-tp\(j>)[l,u] iff Pr{(p) = 0 or Pr{'ip\(l)) g[I,u]-, 

• Pr \= -iF iff not Pr \= F\ 

• Pr \= {F A G) iff Pr \= F and Pr ^ G. 

I say Pr satisfies F, or Pr is a model of F, iff Pr \= F. Moreover, Pr satisfies a set of 
logical constraints and probabilistic formulas T , or Pr is a model of F, denoted Pr ^ T, 
iff Pr is a model of all F gF. 

2.2 Model-Theoretic Probabilistic Logic 

I now recall the model-theoretic notions of satisfiability and logical entailment. 

A set of logical constraints and probabilistic formulas F is satisfiable iff a model 
of F exists. A conditional constraint {-tf;\(j>)[l,u] is a logical consequence of F, de- 
noted F {'4>\(j>)[l, u], iff each model of F is also a model of {ip\(l))\l, u]. It is a tight 
logical consequence of F, denoted F \\= tight {fi\4')[l,'u], iff 1= inf Pr('!/)]0) (resp., 
u = sup Pr(^ ](/>)) subject to all models Pr of F with Pr{4>) > 0. Here, I define I = 1 and 
u = 0, when F\\= (i^jT)[0,0]. A probabilistic knowledge base KB = {L,P) is satisfi- 
able iff LU P is satisfiable. A conditional constraint {tp\(j>) [(, m] is a logical consequence 
of KB, denoted KB ]]=('!/;] m], iff LUP\\= {'ip\(j))[l,u]. It is a tight logical conse- 
quence of KB, denoted KB \\= tight {fi\4>)[l,u], iff LU P \\= tight {-filfi) [l,u]. 
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Table 1. Tight intervals under logical and g-coherent entailment from KB in Example 2.1 



Conditional Event 


II — tight 


Iky ^ 

If tight 


Conditional Event 


1 1 — tight 


|ky ^ 

If tight 


legs 1 bird 


[1,1] 


[1,1] 


fly\bird 


[0.95, 1] 


[0.95, 1] 


legs 1 eagle 


[1,1] 


[0,1] 


fly\ eagle 


[0,1] 


[0,1] 



Example 2.3. Let KB = {L, P) be as in Example 2.1. In model-theoretic probabilistic 
logic, KB represents the logical knowledge “all eagles are birds” and “all birds have 
legs”, and the probabilistic knowledge “birds fly with a probability of at least 0.95”. It 
is not difficult to see that KB is satisfiable. Some tight logical consequences of KB are 
shown in Table 1 , left sides. For example, {fly \ eagle ) [0 , 1] is a tight logical consequence 
of KB. Observe that the logical property of having legs is inherited from birds down to 
the subclass of eagles, while the purely probabilistic property of being able to fly with 
a probability of at least 0.95 is not inherited. □ 



2.3 Probabilistic Logic under Coherence 

I now recall the notions of g-coherence and g-coherent entailment. I define them by 
using some characterizations through concepts from default reasoning [7]. 

A probabilistic interpretation Pr verifies a conditional constraint {-tp\(j>)[l,u] iff 
Pr{(f>) > 0 and Pr ^ {'tp\(j))[l, u]. A set of conditional constraints P is under a set of 
logical constraints L in conflict with {-tf;\(j>)[l,u] iff no model of L U P verifies {if\(f))[l,u]. 
A conditional constraint ranking a on a probabilistic knowledge base KB = {L,P) maps 
each element of P to a nonnegative integer. It is admissible with KB iff every P' CP 
that is under L in conflict with some C G P contains a conditional constraint C" such 
that cr(C") < cr(C). A probabilistic knowledge base KB is g-coherent iff there exists a 
conditional constraint ranking on KB that is admissible with KB. 

Let AP = (P, P) be a g-coherent probabilistic knowledge base, and let {4)\4i) [I, u] be 
a conditional constraint. Then, {'tp\(j>)[l,u] is a g-coherent consequence of KB, denoted 
KB \\-'^^{if\(l>)[l, m], iff (P, PU {(V’|0)[p,p]}) is not g-coherent for allpG [0, 1) U {u, 1]. 

is a tight g-coherent consequence of KB , denoted KB {^p\4>)[l,u],iffl= infp 

(resp., u= supp) subject to all g-coherent (P, PU {(7/’|</))[p,p]}). 

Example 2.4. Let KB = (P, P) be as in Example 2.1. In probabilistic logic under co- 
herence, KB represents the logical knowledge “all eagles are birds”, the default logical 
knowledge “generally, birds have legs”, and the default probabilistic knowledge “gen- 
erally, birds fly with a probability of at least 0.95”. It is not difficult to see that KB 
is g-coherent. Some tight g-coherent consequences of KB are shown in Table 1, right 
sides. Observe that under g-coherent entailment, neither the logical property of having 
legs nor the purely probabilistic one of being able to fly with a probability of at least 
0.95 is inherited from the class of birds down to the subclass of eagles. □ 

3 Probabilistic Lexicographic Entailment of Strength A 

I now introduce a new probabilistic generalization of Lehmann’s lexicographic entail- 
ment, called Zea;A-entailment, which is parameterized through a value A G [0, 1] that de- 
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scribes the strength of the inheritance of purely probabilistic knowledge. I first describe 
the main ideas behind the new formalism, I then define the concept of A-consistency for 
probabilistic knowledge bases, and 1 finally define the notion of lexA-entailment. 

3.1 Key Ideas 

The inheritance of logical knowledge along subclass relationships is the following prop- 
erty (for all events ip, (f>,(j)*, probabilistic knowledge bases KB, and c G {0, 1}): 

L-INH. If KB ||~ {ip\(p)[c, c\ and cp<^(p* is valid, then KB ||^ {ip\(p*)[c, c]. 

The inheritance of purely probabilistic knowledge along subclass relationships is defined 
as follows (for all events ip, cp, (p*, probabilistic knowledge bases KB, and intervals 
[I, u] C [0, 1] different from [0, 0], [1, 1], and [1, 0]): 

P-INH. If KB ||~ {ip\(p)[l, u] and (p<^ (p* is valid, then KB ||~ {ip\(p*)[l, tt]. 

It is not difficult to verify that logical entailment satisfies (L-INH), but does not 
satisfy (P-INH), while g-coherent entailment satisfies neither (L-INH) nor (P-INH). 

The basic idea behind the new probabilistic generalization of Lehmann’s lexico- 
graphic entailment in this paper is that it adds to the notion of logical (resp., g-coherent) 
entailment (i) some inheritance of purely probabilistic (resp., logical and purely prob- 
abilistic) knowledge, where the inheritance of purely probabilistic knowledge depends 
on a strength A G [0, 1], and (ii) a mechanism for resolving inconsistencies due to the 
inheritance of logical and purely probabilistic knowledge. 

The strength A G [0,1] determines to which extent purely probabilistic knowledge 
is inherited from classes down to subclasses. In the extreme cases of A = 0 and A = 1, 
purely probabilistic knowledge is not inherited at all [29] and completely inherited [30], 
respectively, while for 0 < A < 1, given the interval [I, u] for the property of a class, 
some interval [r, s] 3 [(, u] is inherited down to all subclasses, where the tightness of 
[r, s] depends on the strength A (roughly, the higher is A, the tighter is [r, s]). 

3.2 A-Consistency 

I now introduce the notion of A-consistency for probabilistic knowledge bases. 

A probabilistic interpretation Pr X-verifies a conditional constraint {ip\(p)\l,v] iff 
Pr verifies {ip\(p)\l,v] and Pr{(p)>X. A set of conditional constraints P X-tolerates a 
conditional constraint C under a set of logical constraints L iff L U P has a model that 
A-verifies C. I say P is under L in X-conflict with C iff no model of L U P A-verifies C. 
A conditional constraint ranking ct on a probabilistic knowledge base KB = {L, P) 
is X-admissible with KB iff every P' CP that is under L in A-conflict with some C G P 
contains some C such that cr(C') < ct(C). 

1 say KB is X-consistent iff there exists a conditional constraint ranking cr on KB that 
is A-admissible with KB. Note that the notion of 0-consistency coincides with the notion 
of g-coherence. The following theorem characterizes the A-consistency of KB = {L, P) 
through the existence of an ordered partition of P. 
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Theorem 3.1. A probabilistic knowledge base KB = (L, P) is X-consistent iff there 
exists an ordered partition (Pg, . . . , P^) of P such that every Pi, 0 <i<k, is the set of 
all C G Uj=i Pj X-tolerated under L by Uj=i Pj- 

I call this ordered partition (Pg, . . . , P^) of P the z\-partition of KB = {L,P). The 
following two examples show some za - partitions. 

Example 3.1. Consider the probabilistic knowledge base KB = (L,P) given in Exam- 
ple 2.1. For every A G [0, 1], the z> -partition of KB is given by (Pg) = (P). □ 

Example 3.2. Let KB = {L, P) be as in Example 2.2. For all Ag[ 0, the za - partition 

of KB = (L,P) is (Pg) = (P), as KB \\=ught lostrich\T)[0, ^]. For all Ag(^, l],itis 
(Pg, Pi)=({(mn5s|&zr(i)[0.65,0.75], {fly \bird)[0. 95,1]}, {{fly \ostrich)[0, 0.05]}). □ 

3.3 Probabilistic Lexicographic Entailment of Strength A 

I now define a probabilistic generalization of Lehmann’s lexicographic entailment [24] 
of strength A G [0, 1] for A-consistent probabilistic knowledge bases KB = {L, P). 

1 use the ZA-partition (Pg, . . . , P^) of KB to define a lexicographic preference re- 
lation on probabilistic interpretations as follows. For probabilistic interpretations Pr 
and Pr', I say Pr is lex \-preferable to Pr' iff some i G {0, . . . ,k} exists such that 
KCGP.lPr h C}] > |{CgP, I Pr' h C}] and |{CgP, | Pr h C}] = |{CgP, | Pr' h 
C}] for all i<j<k. A model Pr of a set of logical constraints and probabilistic formnlas 
P is a lex\-minimal model of T iff no model of T is Zea;A-preferable to Pr. I use the 
expression f^Xto abbreviate the probabilistic formula -i((/)|T)[0, 0] A (i^|T)[A, 1]. 

I now debne the notion of lex\-entailment as follows. A conditional constraint 
{f]4>)[l, u] is a lex\-consequence of KB, denoted KB ||~ {il}](j))[l,u],iff every lex\- 

minimal model of P U {^PA} satisbes (■(/; | (/>)[[, m]. It is a tight lex\-consequence 
of KB, denoted KB {'f](j))[l,v], iff I (resp., u) is the infimnm (resp., snpremnm) 

of Pr{f](f)) subject to all /exA-minimal models Pr of LU {4>>: A}. 

The following example shows some tight conclusions under Pa;A-entailment. Similar 
to its classical counterpart, Pa;A-entailment realizes some subclass inheritance, without 
showing the problem of inheritance blocking, that is, properties are also inherited to 
subclasses that are exceptional relative to other properties. Observe also that logical 
properties are completely inherited along subclass relationships, while the inheritance 
of purely probabilistic properties depends on the strength A. 

Example 3.3. Some tight intervals under Zea;A-entailment from KB = {L, P) of Exam- 
ple 2.1 (resp., 2.2) are shown in Table 2 (resp., 3). For example, [/, m] with KB 
{fly]eagle)[l,u] is given by P U P U {(ea(/(e|T)[A, 1]} ]\=ught {fly]eagle)[l,u]. □ 

4 Semantic Properties 

In this section, I explore the semantic properties of Zea;A-entailment. I hrst study some 
general nonmonotonic properties. I then explore the relationship to logical entailment 
and to Lehmann’s lexicographic entailment. 
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Table 2. Tight intervals under lexA-entailment from KB in Example 2.1 



Conditional Event 


A = 0 


04 

d 

II 


A = 0.4 


A = 0.6 


00 

d 

II 


A = 1 


legs\ bird 


[1,1] 


[1,1] 


[1,1] 


[1,1] 


[i,i[ 


[1,1] 


legs\ eagle 


[ 1 , 1 ] 


[ 1 , 1 ] 


[ 1 , 1 ] 


[ 1 , 1 ] 


[ 1 , 1 ] 


[ 1 , 1 ] 


fly\bird 


[0.95, 1] 


[0.95, 1[ 


[0.95, 1] 


[0.95, 1] 


[0.95, 1] 


[0.95, 1[ 


fly \ eagle 


[ 0 , 1 ] 


[ 0 . 75 , 1 ] 


[ 0 . 88 , 1 ] 


[ 0 . 92 , 1 ] 


[ 0 . 94 , 1 ] 


[ 0 . 95 , 1 ] 



Table 3. Tight intervals under lexx-entailment from KB in Example 2.2 



Conditional Event A = 0 A = 0.2 A = 0.4 A = 0.6 A = 0.8 A = 1 
wings\hird [0.65,0.75] [0.65,0.75] [0.65,0.75] [0.65,0.75] [0.65,0.75] [0.65,0.75] 

wings\ostrich [0,1] [0,1] ]0.13, 1] [0.42,1] [0.56,0.94] ]0. 65, 0.75] 

fly\bird [0.95,1] [0.95,1] [0.95,1] [0.95,1] [0.95,1] [0.95,1] 

fly\ostrich [0,0.05] [0,0.05] [0,0.05] [0,0.05] [0,0.05] [0,0.05] 



1 first consider the postulates Right Weakening (RW), Reflexivity (Ref), Left Logical 
Equivalence (LLE), Cut, Cautious Monotonicity (CM), and Or by Kraus et al. [19], 
which are commonly regarded as being particularly desirable for any reasonable notion 
of nonmonotonic entailment. The following result shows that texv-entailment satisfies 
(probabilistic versions of) these postulates. Here, KB |[~ {4>\e V e') [(, u] denotes that 

Pr \= {4>\e) [(, u] V {(f>\e') [I, rt] for all (exx -minimal models Pr of L U {e ^ A V e' ^ A}. 

Theorem 4.1. Let KB = (L, P) be a X-consistent probabilistic knowledge base, let 
£, e' , 4>, be events, and let I, V , u, u'g[0,1]. Then, 

RW. If {(f>\T)[l,u] => {iIj\T)[ 1' ,u'] is logically validand KB *®^^(</)|e)[(,u], 
then KB ||~ ^“^(t/jje)]/', u']. 

Ref KB\\r'^^^{£\e)[l,l]. 

LLE. Ife^ e' is logically valid, then KB\\-^ ((/ije) [(, u\ iff KB\\^ 

Cut. IfKB\\-^ ^“^(e|e')[l, 1] and KB\\^ {4>\e/\£')[l, u], then KB\\-^ ^^^^{4>\£')[l, u]. 

CM. IfKB\\^ ^^^^{£\£')[1, 1] andKB\y^ {4>\£')[l,u], thenKB\\^ {4>\£X£')[l,u]. 

Or. IfKB\[^ {(j>\£)[l, u] and KB\\^ ^®"^^((/)|e')[(, u], then KB\\^ u]- 

Another desirable property is Rational Monotonicity (RM) [19], which describes a 
restricted monotony and allows to ignore some irrelevant knowledge. The next theorem 
shows that (ea;A-entailment satisfies (a weak form of) RM. Here, KB\^ *®^^-i(£'|£r)[l, 1] 
denotes that Pr [= (e'je)]!, 1] for some (exA-minimal model Pr of LU {ff^A}. 

Theorem 4.2. Let KB = {L, P) be a X-consistent probabilistic knowledge base, and let 
£,£' ,ip be events. Then, 

RM. [1,1] flnrfAP|^'""=^-.(e'|£)[l,l], then (V’|£A£')[1,1]- 

I next explore the relationship to logical entailment with conditional constraints. 
The following theorem shows that (exA-entailment of (V'j^) [f u] from KB = (L, P) is 
weaker than logical entailment of {tp\(j>) [I, u] from LU PU{4>yX}. 
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Theorem 4.3. Let KB = {L, P) be a X-consistent probabilistic knowledge base, and let 
[I, u] be a conditional constraint. Then, KB ||~ ^ i4’\4>) [h u] implies L U P U 

In general, the converse does not hold. But, in the special case when LUPU{(/)PA} 
is satisfiable, Px^-entailment of u] from KB = (P, P) coincides with logical 

entailment of (V'l^) [^7 m] from P U P U {(/) P A}, as the following theorem shows. 

Theorem 4.4. Let KB = {L, P) be a X-consistent probabilistic knowledge base, and let 
[I, u] be a conditional constraint such that LD P U {(j}^ X} is satisfiable. Then, 

KB\\r^^^^{fi\fi)[l,u] ijf L\JP\J{fi>X}\^{^mM■ 

\ finally study the relationship to Lehmann’s lexicographic entailment. The following 
result shows that the new notion of /exA-entailment for A-consistent prohabilistic knowl- 
edge bases generalizes Lehmann’s lexicographic entailment for ^-consistent conditional 
knowledge bases, denoted below. 

Theorem 4.5. Let KB = (P, P) be a X-consistent probabilistic knowledge base, where 
P = 1] I i G {1, . . . , n}}, and let (/3|a)[l, 1] be a conditional constraint. 

Then, PTP ||- 1] ijf (P, {V-i ^ | i G {1, . . . , n}}) 

5 Special Cases 

The notion of fexA-entailment of strength A = 0 (resp., A = 1) coincides with the notion 
of probabilistic lexicographic entailment introduced in [29] (resp., [30]). I now briefly 
review these formalisms along with some of their applications. 

5.1 Probabilistic Lexicographic Entailment of Strength 0 

The notion of fexo -entailment adds to logical (resp., g-coherent) entailment a strategy for 
resolving inconsistencies due to the inheritance of logical knowledge (resp., a restricted 
form of inheritance of logical knowledge). This is why fexo-entailment is weaker than 
logical entailment and stronger than g-coherent entailment. Hence, (ezo-entailment is a 
refinement of both logical and g-coherent entailment. It can be used in place of logical 
entailment, when we want to resolve inconsistencies related to conditioning on zero 
events. Here, it is especially well-suited as it coincides with logical entailment as long as 
we condition on non-zero events [29]. Moreover, (ezo-entailment can be used in place 
of g-coherent entailment, when we also want to have a restricted form of inheritance 
of logical knowledge. The following example illustrates the use of Pajp -entailment to 
resolve inconsistencies related to conditioning on zero events. 

Example 5.1. Consider the probabilistic knowledge base KB = (L,P) given by P = 
{bird 4= penguin} and P= {((egs| &zrd)[l, 1], {fly\bird)[l, 1], {fly\penguin)[0, 0.05]}. 
It is not difficult to see that KB is satisfiable, g-coherent, and 0-consistent. Moreover, 
it holds that KB \\=ught {legs\penguin)[l, 0] and KB \\=ught {fly\penguin)[l, 0]. 

Here, the empty interval is due to the fact that the logical property of being able to 
fly is inherited from birds to penguins, and is incompatible there with penguins being 
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able to fly with a probability of at most 0.05. That is, there exists no model Pr oi L\J P 
such that Pr{penguin) > 0, and thus we are conditioning on the zero event penguin. 

Hence, logical entailment does not provide the desired tight conclusions about pen- 
guins from KB: Rather than {legs\penguin)[l, 0] and {fly\penguin)[l, 0] , we would like 
to conclude {legs \penguin) [1, 1] and {fly\penguin) [0, 0.05], respectively. These are ex- 
actly the tight conclusions about penguins obtained under Zexo-entailment: 

KB hught {legs\penguin)[l, 1], KB {fly\venguin)[Q,Q.Qb] . 

Note that the tight intervals under g-coherent entailment from KB are as follows: 

KB hught {legs\penguin)[l}, 1], KB {fly\penguin)[0, 0.05] . 

Hence, also g-coherent entailment resolves inconsistencies related to conditioning on 
zero events. However, g-coherent entailment is strictly weaker than Zezo-entailment, and 
thus does not always produce the desired tight conclusions. □ 

5.2 Probabilistic Lexicographic Entailment of Strength 1 

The notion of Zexi-entailment adds to logical entailment (i) some inheritance of purely 
probabilistic knowledge, and (ii) a strategy for resolving inconsistencies due to the in- 
heritance of logical and purely probabilistic knowledge. For this reason, Zexi -entailment 
is generally much stronger than logical entailment. Thus, it is especially useful where 
logical entailment is too weak, for example, in probabilistic logic programming [28,27] 
and probabilistic ontology reasoning in the semantic web [18]. Other applications are 
deriving degrees of belief from statistical knowledge and degrees of belief, handling 
inconsistencies in probahilistic knowledge bases, and probabilistic belief revision. 

In particular, in reasoning from statistical knowledge and degrees of belief, lexi- 
entailment shows a similar behavior as reference-class reasoning [35,20,21,34] in a 
number of uncontroversial examples. But it also avoids many drawbacks of reference- 
class reasoning [30]: It can handle complex scenarios and even purely probahilistic 
subjective knowledge as input. Moreover, conclusions are drawn in a global way from 
all the available knowledge as a whole. The following example illustrates the use of 
fez 1 -entailment for reasoning from statistical knowledge and degrees of belief. 

Example 5.2. Suppose that we have the statistical knowledge “all penguins are birds”, 
“between 90% and 95% of all birds fly”, “at most 5% of all penguins fly”, and “at least 
95% of all yellow objects are easy to see”. Moreover, assume that we believe “Sam is a 
yellow penguin”. What do we then conclude about Sam’s property of being easy to see? 
Under reference-class reasoning, which is a machinery for dealing with such statistical 
knowledge and degrees of belief, we conclude “Sam is easy to see with a probability of 
at least 0.95”. This is also what we obtain using the notion of fez i -entailment: 

More precisely, the above statistical knowledge can be represented by the proba- 
bilistic knowledge base iLi? = {L,P) = {{bird penguin}, {{fly \bird)[0. 9, 0.95], 
{fly] penguin) [0, 0.05], {easy Jo -see]yellow)[0. 95, 1]}). It is then not difficult to verify 
that KB is 1-consistent, and that under fez i -entailment from KB, we obtain the tight 
conclusion {easyJo-see]yellowApenguin)[0.95, 1], as desired. 
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Note that KB is also satisfiable and g-coherent. However, under both logical and 
g-coherent entailment from KB, we obtain the tight conclusion {easy do .see\yellow A 
penguin)[Q, 1], rather than the above desired one. □ 



6 Summary and Outlook 

I have presented the notion of lexA-entailment, which is a probabilistic generalization 
of Lehmann’s lexicographic entailment that is parameterized through a value A G [0, 1], 
which describes the strength of the inheritance of purely probabilistic knowledge. In the 
special case of A = 0 (resp., A = 1), the new probabilistic formalism coincides with prob- 
abilistic lexicographic entailment in [29] (resp., [30]). I have shown that [ea;A-entailment 
has similar properties as its classical counterpart. In particular, it satisfies the rational- 
ity postulates of System P and the property of Rational Monotonicity. Furthermore, 
/exA-entailment has a proper embedding of its classical counterpart. 

An interesting topic of future research is to develop algorithms for probabilistic 
reasoning under [eaiA-entailment and to analyze its computational complexity (e.g., 
along the lines of [29,30]). Another exciting topic of future research is to develop and 
explore further formalisms for nonmonotonic probabilistic reasoning. 
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Abstract. Most formal approaches to argumentative reasoning under 
uncertainty focus on the analysis of qualitative aspects. An exception is 
the framework of probabilistic argumentation systems. Its philosophy is 
to include both qualitative and quantitative aspects through a simple 
way of combining logic and probability theory. Probabilities are used 
to weigh arguments for and against particular hypotheses. ABEL is a 
language that allows to describe probabilistic argumentation systems 
and corresponding queries about hypotheses. It then returns arguments 
and counter-arguments with corresponding numerical weights. 



1 Introduction 

In the last couple of years, argumentation has gained growing recognition as a 
new and promising research direction in artificial intelligence. As a consequence 
of this increasing interest, different authors have investigated argumentation 
and its applications in various domains. By looking at today’s literature on this 
subject, one realizes that argumentation is understood in fairly different ways. 
The common feature of most approaches is their restriction to particular types of 
logic. As a consequence, they are all limited in the way they combine arguments 
for and against a particular hypothesis. 

The approach we present in this paper is known as probabilistic argumen- 
tation systems (PAS) [9]. The idea of the PAS framework goes back to the 
concept of assumption-based truth maintenance systems (ATMS) [6]. It is also 
closely related to abduction [4,11]. The idea is to understand argumentation as 
a deductive tool that helps to judge hypotheses, that is open questions about 
the unknown or future world, in the light of the given uncertain and partial 
background knowledge. 

The principal PAS problem is to derive arguments in favor and counter- 
arguments against the hypothesis of interest. There are efficient anytime algo- 
rithms in which the search is focussed on the most relevant arguments [7,8] . The 
strength of the arguments is then measured by underlying probabilities. This 
leads to degree of support and degree of possibility, which corresponds to belief 
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and plausibility, respectively, in the Dempster-Shafer theory of evidence [10,13, 
14]. Such a quantitative judgement is often required to decide whether a hy- 
pothesis can be accepted, rejected, or whether the available knowledge does not 
permit to decide. 

A system called ABEL [2,3] is an implementation of probabilistic argumen- 
tation systems (check out http://www2-iiuf.unifr.ch/tcs/ABEL). It includes 
an appropriate modeling and query language, as well as corresponding inference 
mechanisms. ABEL is an interactive system in which queries are answered imme- 
diately. Problems from a broad spectrum of application domains show that the 
ABEL system is very general and powerful [1]. It has an open architecture that 
permits the later inclusion of further or more advanced deduction techniques. 

The aim of this paper is to provide a short introduction to probabilistic ar- 
gumentation and ABEL. Our hope is to increase the recognition of PAS as a 
legitimate formal model and ABEL as powerful tool for reasoning under uncer- 
tainty. 

2 Probabilistic Argumentation Systems 

The basic ingredients for probabilistic argumentation systems (PAS) are propo- 
sitional logic and probability theory. More formally, we require two disjoint sets 
P = {pi, . . . ,pn} and A = {oi, . . . , Om} of propositional symbols. The elements 
of P are called propositions and the elements of A assumptions. With Caup 
we denote the corresponding propositional language that consist of elements of 
A U P only. Furthermore, we require a propositional sentence ^ G Caup that 
expresses the qualitative part of the given knowledge. The formula ^ is called 
knowledge base. Finally, a set iT = {p{ai) : Oi G A} of independent probabili- 
ties is required to express the quantitative knowledge. Note how the connection 
between propositional logic and probability theory is established through the as- 
sumptions. A quadruple {P, A, II) is called probabilistic argumentation system 
(PAS). 

Example 1. Let P = {X, Y, Z} and A = {oi, C2, 03, 04, 05} be the sets of propo- 
sitions and assumptions, respectively. Furthermore, suppose that 

n = {p(oi) = 0.2, p{a 2 ) = 0.4, p{a^) = 0.8, ^(04) = 0.3, p(as) = 0.3} 
are the probabilities of the assumptions and 

X)^ ((o2 V -03) ^ r) A ((X ^Y)^ Z)^ (^04 ^ Z) 

A ((05 A y) — >■ ~'Z) 

the given knowledge base. This forms a probabilistic argumentation system 
(P, A, 77). Note that the knowledge base ^ is a conjunction that can be repre- 
sented more easily as a conjunctive set 

E = {oi — >■ X, (o2 V “'03) — >■ y, (AT A y) — >■ Z, -104 — >■ Z, (05 A y) — >■ -•z} 

of five individual formulas. 
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The question now is how to use a PAS for the purpose of analyzing and 
answering queries about hypotheses. A hypothesis h is usually expressed by a 
simple expression that includes symbols of AUP. To be most general, we consider 
arbitrary propositional formulas h G Caup- 

The approach we promote is to construct arguments and counter -arguments 
based on the set of assumptions A and to weigh them with the aid of the given 
probabilities U. An argument can be regarded as a defeasible proof. In other 
words, arguments are combinations of true or false assumptions that permit 
to infer the truth of the hypothesis h from the given knowledge base. Every 
argument provides thus a sufficient reason that proves the hypothesis in the 
light of the available knowledge. And it finally contributes to the possibility of 
believing or accepting the hypothesis. In other words, arguments support and 
counter-arguments defeat the hypothesis h. Note that counter-arguments can be 
regarded as arguments in favor of the negated hypothesis -'h and vice versa. 
The sets of all arguments and counter-arguments are denoted by sp{h^ f,) and 
sp{->h,f), respectively. For corresponding formal definitions and descriptions of 
appropriate inference techniques we refer to the literature [8,7,9]. 



Example 2 . Consider the same PAS as in Example 1 and let h = Z be the 
hypothesis of interest. There are four (minimal) arguments, namely: 



Oi A 02 A -'O5 


because oi implies X , 02 implies Y, X and Y imply Z, and 
-■05 disallows the conflict Z A -iZ 


Oi A“'a3A“'a5 


because Oi implies X, -103 implies Y, X and Y imply Z, 
and -lOs disallows the conflict Z A -•Z 


“■04 A “lOs 


because -104 implies Z and -105 disallows the conflict ZA-iZ 


~'a2/\a3A~>a4 


because -104 implies Z and -102 A 03 disallows the conflict 
Z A “>Z 



Similarly, there are two counter-arguments, namely: 



-■oi A 02 A 04 A 05 


because 02 implies Y, a^AY implies ~<Z, and -iOiAo4 
disallows the conflict Z A -■Z 


“■oi A “103 A 04 A 05 


because -103 implies Y, 05AF implies ~iZ, and -lOiA 
04 disallows the conflict Z A -■Z 



Note that oi A 02 A 05, oi A -<03 A 05, and -104 A 05 are not compatible with the 
knowledge base Such incompatible terms are called conflicts. 



A quantitative judgement of the situation is obtained by considering the prob- 
abilities that the arguments and counter-arguments are valid. The credibility of 
a hypothesis is measured by the probabilities that it is supported or defeated 
by at least one argument or one counter-argument, respectively. Conflicts are 
handled through conditioning. The resulting degree of support dsp{h, and de- 
gree of possibility dps{h,ff) = 1 — dsp{h,ff) correspond to belief and plausibility, 
respectively, in the Dempster-Shafer theory of evidence [13,14]. 
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Example 3. Consider the arguments, counter-arguments, and conflicts shown in 
the previous example. The probabilities that at least one argument, counter- 
argument, or conflict holds correspond to the probabilities of the disjunctive 
normal forms (DNF) 



= (aiAa2A“'a5) V {a\/\~'a^/\~'a^) V (~'a4A“'a5) V (~'a2Aa3A“'a4), 
<^-,2 = (-'aiAa2Aa4Aa5) V (-'aiA-'a3Aa4Aa5), 

<?_L = (aiAo2Aa5) V (aiA-'a3Aa5) V (-ia4Aa5), 



respectively. Using the probabilities p{ai) as specifled in Example 1, we get 
p{^z) = 0.612, p{‘P^z) = 0.037, and = 0.119. For information about 

how to compute probabilities of DNF’s we refer to the corresponding literature, 
in particular to Darwiche’s d-DNNF compiler [5]. Finally, we get the following 
degree of support and degree of possibility, respectively: 



dsp{Z,^) 



Pj^z) 

l-p{<P±) 



0.695, 



dps{Z,^) = 1 - 



Pj'^^z) 

l-p{^±) 



0.958. 



These results tell us that the hypothesis Z is supported by a relatively high 
degree. At the same time, there are only few reasons against Z which leads to a 
degree of possibility close to 1. 



3 ABEL 

ABEL stands for “Assumption-Based Evidential Language” . Working with 
ABEL typically involves two sequential steps. First, the given information is 
modeled using the command tell. This command is used to define the two sets 
A and P, the probabilities 77, and the knowledge base Second, queries about 
the knowledge base are expressed using the command ask. 

The ABEL language is based on three other computer languages: (1) from 
Common Lisp [16] it adopts prefix notation] (2) from Pulcinella [12] it takes the 
idea of the commands tell, ask, and empty; and (3) from a former ABEL pro- 
totype it inherits the concept of modules and the syntax of the queries. Consider 
former publications on ABEL for a detailed language specification [3,1]. The 
ABEL interface is interactive and behaves like a Common Lisp environment. 
The current version is based on the platform independent XEmacs environment 
[15]. 

An ABEL model usually starts with the declaration of the sets P, A, and LI. 
The distinction between the elements of P and A is made by using two distinct 
commands var and ass. Look below how it’s done for the example introduced 
in the previous section. Assumptions with different probabilities must be defined 
on different lines. The keyword binary means that only two values are allowed 
{true and false). Note that ABEL also supports discrete variables with more 
than two values [1,2,3] as well as integers and reals (with some restrictions) [3]. 

(tell 

(var X Y Z binary) 
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(ass al binary 0.2) 

(ass a2 binary 0.4) 

(ass a3 binary 0.8) 

(ass a4 a5 binary 0.3)) 

The knowledge base ^ is then described using a LISP-like prefixed language. If 
^ is given as a set S of statements ^i, then every individual statement is written 
on a separate line. Again, consider the example of the previous section and look 
how it’s done. 

(tell 

(-> al X) 

(-> (or a2 (not a3)) Y) 

(-> (and X Y) Z) 

(-> (not a4) Z) 

(-> (and a5 Y) (not Z))) 

Statements can also be distributed among different tell-commands. Further- 
more, it is also possible to mix variable declarations and statements about the 
knowledge base. The only rule is that every variable must be declared before it 
is first used. 

ABEL supports different types of queries. In the context of argumentative 
reasoning, the most important commands are sp (support), dsp (degree of sup- 
port), and dps (degree of possibility). Let Z be the hypothesis of interest as in 
Example 2 . Observe how sp can be used to compute arguments and counter- 
arguments for Z (the percentages indicated left to the arguments show the rel- 
ative weights of their probabilities) . 

? (ask (sp Z)) 



53 . 37 . 


(NOT A4) (NOT A5) 


24.07. 


Al 


A2 (NOT A5) 


18.77. 


Al 


(NOT A3) (NOT A5) 


4 . 07 . 


A3 


(NOT A2) (NOT A4) 



? (ask (sp (not Z))) 

56.37. : A2 A4 A5 (NOT Al) 

43.77. : A4 A5 (NOT Al) (NOT A3) 

This corresponds to the results shown in Example 2 . Note that -■04 alone is 
not an argument for Z, because -104 together with 05 produces a conflict. To 
get a quantitative evaluation of the hypothesis, we can compute corresponding 
degrees of support and possibility. 

? (ask (dsp Z)) 

0.695 



? (ask (dps Z)) 
0.958 



These results correspond to the ones shown in Example 3 . 




ABEL: An Interactive Tool for Probabilistic Argumentative Reasoning 593 



Acknowledgements. Research supported by (1) Alexander von Humboldt 

Foundation, (2) German Federal Ministry of Education and Research, (3) Ger- 
man Program for the Investment in the Future, (4) Swiss National Science Foun- 
dation 

References 

1. B. Anrig, R. Bissig, R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumen- 
tation systems: Introduction to assumption-based modeling with ABEL. Technical 
Report 99-1, Institute of Informatics, University of Fribourg, 1999. 

2. B. Anrig, R. Haenni, J. Kohlas, and N. Lehmann. Assumption-based modeling 
using ABEL. In D. Gabbay, R. Kruse, A. Nonnengart, and H. J. Ohlbach, edi- 
tors, Proceedings of the First International Joint Conference on Qualitative and 
Quantitative Practical Reasoning ECSQARU/FAPR’97, LNCS 1146, pages 171- 
182. Springer, 1997. 

3. B. Anrig, R. Haenni, and N. Lehmann. ABEL - a new language for assumption- 
based evidential reasoning under uncertainty. Technical Report 97-01, Institute of 
Informatics, University of Fribourg, 1997. 

4. D. Berzati, R. Haenni, and J. Kohlas. Probabilistic argumentation systems and 
abduction. Annals of Mathematics and Artificial Intelligence, 34(1-3):177-195, 
2002 . 

5. A. Darwiche. A compiler for deterministic, decomposable negation normal form. 
In Proceedings of the 18th National Conference on Artificial Intelligence, pages 
627-634. AAAI Press, 2002. 

6. J. de Kleer. An assumption-based TMS. Artificial Intelligence, 28:127-162, 1986. 

7. R. Haenni. Cost-bounded argumentation. International Journal of Approximate 
Reasoning, 26(2):101-127, 2001. 

8. R. Haenni. A query-driven anytime algorithm for argumentative and abductive 
reasoning. In D. Bustard, W. Liu, and R. Sterrit, editors, Soft-Ware 2002, 1st 
International Conference on Computing in an Imperfect World, LNCS 2311, pages 
114-127. Springer- Verlag, 2002. 

9. R. Haenni, J. Kohlas, and N. Lehmann. Probabilistic argumentation systems. In 
J. Kohlas and S. Moral, editors. Handbook of Defeasible Reasoning and Uncer- 
tainty Management Systems, Volume 5: Algorithms for Uncertainty and Defeasible 
Reasoning, pages 221-288. Kluwer, Dordrecht, 2000. 

10. R. Haenni and N. Lehmann. Probabilistic argumentation systems: a new per- 
spective on Dempster-Shafer theory. International Journal of Intelligent Systems 
(Special Issue: the Dempster-Shafer Theory of Evidence), 18(1):93-106, 2003. 

11. D. Poole. Probabilistic Horn abduction and Bayesian networks. Artificial Intelli- 
gence, 64:81-129, 1993. 

12. A. Saffiotti and E. Umkehrer. PULCINELLA: A general tool for propagating un- 
certainty in valuation networks. Technical report, IRIDIA, Universite de Bruxelles, 
1991. 

13. G. Shafer. The Mathematical Theory of Evidence. Princeton University Press, 
1976. 

14. Ph. Smets and R. Kennes. The transferable belief model. Artificial Intelligence, 
66:191-234, 1994. 

15. R. Stallman and B. Wing. XEmacs User’s Manual, 1994. 

16. G.L. Steele. Common Lisp - the Language. Digital Press, 1990. 




The Hugin Tool for Learning Bayesian Networks 



Anders L. Madsen, Michael Lang, Uffe B. Kjaerulff, and Frank Jensen 



Hugin Expert A/S 
Niels Jernes Vej 10 
DK-9220 Aalborg 0 
Denmark 

{Anders . L . Madsen, Michael .Lang, Uffe .Kjaerulf f ,Frank. Jensen} @hugin. com 



Abstract. In this paper, we describe the Hugin Tool as an efficient tool 
for knowledge discovery through construction of Bayesian networks by 
fusion of data and domain expert knowledge. The Hugin Tool supports 
structural learning, parameter estimation, and adaptation of parameters 
in Bayesian networks. The performance of the Hugin Tool is illustrated 
using real-world Bayesian networks, commonly used examples from the 
literature, and randomly generated Bayesian networks. 



1 Introduction 

Probabilistic graphical models such as Bayesian networks [9,3] are efficient mod- 
els for (automated) reasoning under uncertainty. A Bayesian network can be 
used as an efficient tool for knowledge representation and inference. Unfortu- 
nately, the construction of a Bayesian network can be a quite labor intensive 
task to perform. For this reason, automated construction of Bayesian networks 
have in recent years received a lot of attention. This attention has focused on the 
automated construction of models from a combination of data and domain ex- 
pert knowledge. In this paper, we consider the model construction task as a task 
of fusing observational data and domain expert knowledge. Through automated 
construction, Bayesian networks can be used as efficient tools for knowledge 
discovery and data mining [5] . 

The Hugin Tool [1,6] is a general purpose tool for probabilistic graphical 
models such as Bayesian networks and influence diagrams. In this paper, we de- 
scribe the knowledge discovery functionality of the Hugin Tool related to (auto- 
mated) construction of Bayesian networks through learning. That is, we describe 
the capabilities of the Hugin Tool for learning the structure and parameters of 
a Bayesian network. In [6] a recent survey of the general functionality of the 
Hugin Tool is given. The present paper extends and details the description of 
the learning functionality of the Hugin Tool given in [6] . 

2 Preliminaries and Notation 

A Bayesian network J\f = {G = (V,E),P) consists of an acyclic, directed graph 
(DAG) G and a set of probability distributions P. Each node X gV represents a 



T.D. Nielsen and N.L. Zhang (Eds.): ECSQARU 2003, LNAI 2711, pp. 594—605, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




The Hugin Tool for Learning Bayesian Networks 595 



unique random variable. (We use the terms “node” and “variable” interchange- 
ably and consider only discrete variables.) For each variable X € V there is a 
conditional probability distribution P{X \ pa,{X)) G P. 

A Bayesian network Af = {G,P) is an efficient representation of a joint prob- 
ability distribution P{V) over V when G = {V,E) is not dense. P{V) factorizes 
according to the structure of G as: 

P{V)= n ^(^|pa(A)). 

x&v 

We denote variables by uppercase letters X,Y,..., sets of variables by up- 
percase letters S, Sxy, ■ • • , and states of variables by lowercase letters x,y, 

Through the parameters, it is possible to specify unconstrained probabilistic 
dependence relations between a node and its parents. 

The graph G ai Af induces a set of (conditional) dependence and indepen- 
dence relations (CIDRs) Afc, which can be read off G using the d-separation 
criteria [4]. The relation X Tp FIS' states conditional independence between X 
and Y given S under the probability distribution P whereas X Tq Y \ S states 
conditional independence between X and Y given S in the DAG G (i.e. d- 
separation). When no confusion is possible, subscripts are omitted. The faithful- 
ness assumption [13] (a.k.a. stability [10]) says that the distribution P over V 
induced by (G, 0) satisfies no independence relations beyond those implied by 
the structure of G. 

A DAG represents a set of GIDRs and two DAGs may represent the same set 
of GIDRs. Two DAGs representing the same set of GIDRs are equivalent. A DAG 
is an acyclic, directed graph whereas a PDAG is an acyclic, partially directed 
graph, i.e. an acyclic graph with some edges undirected (a.k.a. a pattern [10]). A 
PDAG can be used to represent the equivalence class of a DAG. The equivalence 
class of a DAG G is the set of DAGs with the same set of d-separation relations 
as G. Two DAGs Gi and G 2 are equivalent if they have the same skeleton and 
the same set of colliders (i.e. A — >• F ^ F-structures), see e.g. [10]. 

Example 2.1 [Chest Clinic] 

Dyspnoea{D) may he due to tuberculosis(T) , lung caneer(L), or hronchitis{B) , 
or none of them, or more than one of them. A recent visit to Asia{A) increases 
the chances of tuberculosis, while smoking{S) is known to he a risk factor for 
both lung cancer and bronchitis. The result of a single chest X-ray(X) does not 
discriminate between lung cancer and tuberculosis, as neither does the presence 
or absence of dyspnoea, see e.g. [3]. 

The qualitative knowledge of this diagnostic problem can he captured by the 
DAG shown in Fig. 1(a) with a mediating variable E representing the disjunction 
of tuberculosis and lung cancer. 

3 Learning a Bayesian Network 

In the remainder of this paper, we will assume that Pq is a DAG faithful prob- 
ability distribution with underlying DAG Gq. We consider learning a Bayesian 
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Fig. 1. The Chest Clinic example 



network as the task of identifying a DAG structure G and a set of corresponding 
parameters 0 from a sample of data cases D = {c \, . . . , cn} drawn at random 
from Pq and possibly some domain expert background knowledge. 



3.1 Structural Learning 

Structural learning is supported through a constraint-based approach [16,15, 
13]. In the constraint-based approach, the graph G of a Bayesian network Af is 
considered as an encoding of a set of CIDRs AA . Structural learning is then the 
task of identifying a DAG structure from a set of GIDRs derived from the data 
by statistical tests. 

Two algorithms for structural learning are supported. The PG algorithm [12, 
13] (which is similar to the IG algorithm [15,10]) and its extension, the NPG 
algorithm [14]. The main steps of the PG algorithm are: 

1. Test for (conditional) independence between each pair of variables. 

2. Identify the skeleton of the graph induced by the derived GIDRs. 

3. Identify colliders. 

4. Identify derived directions. 

The PG algorithm produces a PDAG. In step 1, the hypothesis is that X and Y 
are independent given Sxy- This hypothesis is tested by statistical tests using 
conditioning sets Sxy of size 0, 1, 2, 3. li X T P | Sxy is found to be satisfied 
with some significance level a, the search for independence between X and Y is 
terminated. 

Various improvements of the straightforward incremental testing scheme have 
been implemented. These improvements are related to maintaining an undirected 
graph describing the current set of neighbors of each node and only performing 
independence tests conditional on subsets of the neighbors of X and Y. The order 
in which we try out the possible conditioning sets of a fixed size is according to 
how likely they are to cause independence for the edge under consideration. We 
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Fig. 2. Four rules for orientation of edges 



use the heuristic rule that the variables of the conditioning set should be strongly 
correlated with both endpoints of the edge being tested. The neighbor graph is 
updated after each independence test accepting the hypothesis. 

Due to the nature of the testing scheme, the conditioning set Sxy for an 
identified independence relation X J- Y \ Sxy is minimal in the sense that 
no proper subset of Sxy and no set of cardinality less than the cardinality 
of Sxy produce independence. An undirected edge is added between each pair 
of variables X, Y whenever no conditional independence relation has been found 
between X and Y. This produces the skeleton of the graph. 

Once the skeleton has been identified, colliders are identified. If X and Y are 
neighbors, Z and Y are neighbors, X and Z are not neighbors, and Y ^ Sxz 
for any Sxz satisfying X T Z\Sxz, then a collider is created at Y. 

Starting with any PDAG G, a maximally directed PDAG can be obtained 
following four necessary [15] and sufficient [8] rules, see Fig. 2. That is, by re- 
peated application of these four rules all edges common to the equivalence class 
of G are identified. The fourth rule is unnecessary, if the orientation of the initial 
PDAG is limited to colliders (i.e. no background knowledge). The four rules are 
necessary and sufficient for achieving maximal orientation (up to equivalence) 
of the PDAG returned by the PG algorithm. The first rule follows from the fact 
that no collider was identified, the remaining rules ensure that no directed cycle 
is created. 

Gorrectness of the PG algorithm has been proved under the assumption of 
infinite data sets. In real-life, data sets are finite. When dealing with finite data 
sets, the faithfulness assumption is often violated. Hence, when the derived set 
of GIDRs is induced by statistical tests on finite data sets, we cannot in general 
expect that there exists a DAG (or PDAG) which represents all GIDRs. Often 
too many conditional independence relations are derived due to the limited data 
set. This suggests to represent all conditional dependence relations, but not all 
conditional independence relations in the induced DAGs. Applying the principle 
of Occam’s Razor, we will choose the simplest model among equally good models. 

As mentioned above, the NPG algorithm is an extension of the PG algorithm. 
The new feature of the NPG learning algorithm is the introduction of the notion 
of a Necessary Path Condition [14] . Informally, the necessary path condition says 
that in order for two variables X and Y to be independent (in a DAG faithful 
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data set) conditional on a set S and no subset S' C S, there must exist a path 
between X and every Z G S (not crossing Y) and between Y and every Z G S 
(not crossing X). Otherwise, the inclusion of each Z in S' is unexplained. Thus, 
in order for an independence relation to be valid, a number of edges are required 
to be present in the graph. 

An edge {X, Y) is an uncertain edge, if the absence of {X, Y) depends on the 
presence of an edge {X' , Y'), and vice versa. A maximal set of interdependent un- 
certain edges is an ambiguous region. An uncertain edge indicates inconsistency 
in the set of independence relations derived by the statistical tests. 

In order to increase reliability and stability of the NPC algorithm, the it- 
eration step for a fixed size of the conditioning set is completed even if an in- 
dependence statement is found. Thus, multiple independence relations may be 
found for a pair of variables. If one of these independence relations satisfy the 
necessary path condition, then it is accepted. 

Prior to the testing phase, background knowledge in the form of constraints 
on the structure of the DAG can be specified. It is possible to specify the pres- 
ence and absence of edges, the orientation of edges, and a combination. At the 
moment, user specified constraints are not tested. In practice, this has produced 
some unwanted behavior of the edge orientation algorithm. 

Example 3.1 [Structural learning in Chest Clinic] 

Figure 1(b) shows the PD AG generated by the NPC algorithm applied on a ran- 
dom sample of 10, 000 cases drawn from the Chest Clinic network with a signif- 
icance level of a = 0.05. 

The sets of edges {{T, X), (E, X), (L, X)} and {(T, D), {E, D), {L, D)} are 
the two ambiguous regions of Fig. 1(b). The two ambiguous regions are due to 
the deterministic relation between E and L,T (i.e. E = LM T). This produces, 
for instance, {(AT T E \ T,L), {X Y T \ E), {X Y L \ E)}, which is impossible 
according to the necessary path condition. The simplest resolution is to include 
the edge {E,X). Notice that some certain edges are directed and some are undi- 
rected. 

3.2 Parameter Estimation 

The task of parameter estimation is to estimate the values of the parameters 0 
corresponding to a given DAG structure G. Parameter estimation is supported 
through the EM algorithm [7]. The EM algorithm is well-suited for calculating 
maximum likelihood and maximum a posteriori estimates in the case of missing 
data. 

Let Af = (G, V) be a Bayesian network with parameters 0 such that dijk = 
P{Xi = k\pa,{Xi) = j) for each i,j,k. Following [7] the EM algorithm is based 
on computing the expected value of the log-likelihood function: 

Q{0*\0) = E0{logP{X\0*)\D}, 

where P is the density function for X, and D is the observed data D = g{X). 
Given an initial value of the parameters 0, the E-step is to compute the cur- 
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rent expected value of Q with respect to 0 while the subsequent M-step is to 
maximize Q in 0*. These two steps are alternated iteratively until a stopping 
criterion is satisfied. In the case of missing data, the log-likelihood function is a 
linear function in the sufficient marginals [7]. The log-likelihood function l{0 \ D) 
of the parameters 0 given the data D and DAG G is: 

N 

1{0\D) = Y^\ogP{c,\0). 

i=l 

In the case of Bayesian networks, the E-step of the EM algorithm is to com- 
pute expected counts for each family fa(Ai) and parent pa(Ai) configuration of 
each node Xi under 0: 



n*(y)=Ee{n(E)|D}, 



where Y is either pa(Ai) = j or Xi = /c,pa(Aj) = j. The M-step computes new 
estimates of from the expected counts under 9ijk' 

a* ^ n*{Xi = k, pa(Ai) = j) 

n*(pa(A,)=j) ■ 



The E-step and M-step are iterated until convergence of l{0) (or until a 
limit on the number of iterations is reached). In the Hugin Tool convergence 
is achieved when the difference between the log-likelihoods of two consecutive 
iterations is less than or equal to the numerical value of a log-likelihood threshold 
times the log- likelihood. Alternatively, the user can specify an upper limit on the 
number of iterations to ensure that the procedure terminates. 

When both data and domain expert knowledge is available, these two sources 
of knowledge can be fused. In [II] the notion of experience is introduced. Ex- 
perience is the quantitative knowledge related to a probability distribution 
based on quantitative expert knowledge. Expert knowledge on the parame- 
ters is specified as Dirichlet distributions. For each variable Xi, the distribu- 
tion P{Xi I pa(Aj)) = {pijk} and the experience counts an,... ,o;i|pa(Xi)| as- 
sociated with Xi are used to specify the prior expert knowledge. Hence, the 
experience table of a variable Xi indicates the experience related to the child 
distribution for each configuration of the parents. In the case of expert knowl- 
edge, the E-step does not change whereas the M-step becomes: 

_ n*{Xj = k,pa{X) = j) + Pijkajj 
n*{pa,{X) = j) + aij 



The quality of the model is expressed in the value of /(6> 1 1?) computed after 
each iteration. It should be noticed that l{0 | D) as a quality measure does not 
incorporate the complexity of the model. For comparison of models with different 
complexity other measures such as BIG or AIG should be used. 

The experience counts for the prior beliefs in the conditional probability 
distribution of variable Xi given its parents pa(Xi) are specified in a separate 
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table including one experience count for each configuration of pa(Xi). After the 
termination of the EM algorithm the expected counts are stored as the experience 
counts. 

Example 3.2 [Parameter estimation in Chest Clinic] 

Assume that the qualitative knowledge of the Chest Clinic example is as shown in 
Fig. 1 and that a database D = {ci, . . . , cn} with N = 10, 000 is available. From 
the qualitative knowledge, we know E = L\/ T. This is specified in P{E \ L, T) 
and no experience table is allocated to E in order to avoid estimation of this 
table from the data whereas all other variables have an experience consisting of 
zeros indicating no expert knowledge on the distributions. 

The EM algorithm will produce a maximum likelihood estimate of the param- 
eters of the model under the constraint that P{E \ L, T) encodes disjunction. If 
we had expert knowledge on P{S), for instance, we would specify this in P{S) 
and encode the second order uncertainty in the experience table of S. 

3.3 Sequential Updating 

Sequential updating or adaptation [11,3] is the task of sequentially updating the 
conditional probability distributions of a Bayesian network when the structure 
and an initial specification of the conditional probability distributions are given 
in advance. In sequential learning, experience is extended to include both quanti- 
tative expert knowledge and past cases (e.g. from EM learning). Thus, the result 
of EM learning could be used as the input for sequential learning. 

Let Xi be a variable with n states, then the prior belief in the parame- 
ter vector Oij = ( 0 ^ 1 ,... i.e. the conditional probability distribution of 

a variable Xi given its parents pa,{Xi) = j, is specified as an n-dimensional 
Dirichlet distribution 2?(a^i,... ,o;ij„). This distribution is represented using 
a single experience count Ojj (equivalent sample size) and the initial content 
of P{Xi \ pa,{Xi) = j). The experience count aijk for a particular state k of Xi 
given pa(Xj) = j is a^jk = aijPijk. 

After a complete observation on {Xi = k, pa{Xi) = j), the posterior belief in 
the distribution is updated as = aijk + 1 and a*ji = aiji for I yf k. After an 
incomplete observation, the posterior belief in Oij is a Dirichlet mixture, which 
is approximated by a single Dirichlet distribution having the same means and 
sum of variances as the mixture. The approximation is used in order to avoid 
the combinatorial explosion, which would otherwise occur when subsequent in- 
complete observations are made. The updated mean and variance are computed 
as: 



^ijk 
'^*]k = 



^ijk T Pijk T Pijk) 

aij 1 

kriijk{^ niijk) 
an -\- 1 



The updated experience count is computed from the mean and variance. 
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This process is referred to as retrieval of experience. Dissemination of ex- 
perience is the process of calculating prior conditional probability distributions 
for the variables in the Bayesian network given the experience, and it proceeds 
by setting the value of each parameter equal to the mean of the corresponding 
updated Dirichlet distribution, i.e. Oijk = 

In order to reduce the influence of the past and possibly outdated informa- 
tion, an optional feature of fading is provided. Fading proceeds by reducing the 
experience count before the retrieval of experience takes place. The experience 
count aij is faded by a factor of 0 < \ij < 1 typically close to 1 according 
to pij = P{pa,{Xi) = j) such that aL = ajj((l — Pij) + KjPij)- Notice, that 
experience counts corresponding to parent configurations, which are inconsis- 
tent with the evidence are unchanged. The fading factors of a variable Xi are 
specified in a separate table including one fading factor for each configuration 
of pa(Xi). 

Example 3.3 [Adaptation in Chest Clinic] 

Assume we have evidence e = {S = n, A = y, D = y} on a patient, i.e. a non- 
smoking patient with dyspnoea who has recently been to Asia. The evidence is 
entered and propagated followed by an adaptation of parameters. Table 1 shows 
the experience counts for L, B, and S before (i.e. after EM learning using 10, 000 
randomly generated cases) and after the adaptation with fading factor o/ 0.999 
for each distribution. Notice, that since S is an observed variable without parents, 
the experience count as for P{S) will converge to j = 1001 if S = n is observed 
multiple times. 



Table 1. Experience counts for B, L, and S before and after adaptation 





as 


aL\S=no 


^L\S=ye.s 


OS|S=no 


1 5— yes 


Before 

After 


10,000 

9,001 


4970.88 

4472.71 


5029.12 

5029.12 


4970.88 

4473.73 


5029.12 

5029.12 



4 Learning Wizard 

The learning functionality of the Hugin Tool is supported through a Learning 
Wizard. A full learning cycle, as performed by the Learning Wizard consists 
of three main steps: Data acquisition, structural learning, and parameter esti- 
mation. Each of these consists of a number of sub-steps, which guide the user 
in the process of learning the Bayesian network from data and possibly expert 
knowledge. The user has the option of performing only one of the steps, but in 
both cases, the data acquisition step is required. 
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4.1 Data Acquisition 

The data acquisition step serves two purposes: Read data from a data source 
and preprocess the data. In the first step, the user can read in data from various 
data sources, including data bases and data files. In the second step, the user 
can preprocess the data, e.g. discretize a variable. It is also possible for the user 
to use his own preprocessor, if the existing preprocessor does not suffice. 



4.2 Structural Learning and Parameter Estimation 

The structural learning step contains two sub-steps: Structural learning and 
data analysis. In the structural learning phase, the user can choose from two 
algorithms for performing the learning (PC and NPC). Common for these algo- 
rithms are, that the user can control the result to some extent by specifying a 
significance-level parameter and by adding structural constraints on the struc- 
ture of the DAG before the learning takes place. These structural constraints 
provide a way for the user to force known dependences/independences onto the 
learning algorithm. As it can be a tiresome task to specify these constraints 
for complex networks, the wizard facilitates the saving and loading of network 
information, including constraints, node positions, node labels, etc. 

If the user chooses NPC for the learning algorithm, he will also have the 
possibility of resolving ambiguous regions or unresolved directions found during 
the learning process, see e.g. Fig. 1. In the data analysis phase, the strength of 
both the marginal dependences and the found data dependences can be examined 
and the complexity of the learned network is indicated. 

The parameter estimation phase gives the user the possibility of specifying 
the initial value of the parameters and the parameters for the EM-algorithm. 
The initial distribution is determined by any prior possibilities and experience 
counts specified by the user. To examine if the algorithm may have found a local 
maximum, it is possible to randomize the prior probabilities, so that the initial 
distribution can be different for subsequent runs. 

5 Performance Evaluation 

In the performance evaluation we have used the ALARM network [2] , which has 
become a standard benchmark for structural learning. The ALARM network 
consists of 37 variables and 46 edges. Each variable has between two and four 
states with an average of 1.2 parents of each variable. 

The PDAG shown in Fig. 3, which is the result of NPC learning on a 
sample of 10, 000 cases generated from the ALARM [2] network with a sig- 
nificance level a = 0.01, contains three ambiguous regions. The three ambigu- 
ous regions will be resolved by selecting the correct edges ((ArtC02, Catechol), 
(VentLung,KinkedTube), and (LVFailure,LVED Volume)) and adjacent edges are 
directed correctly. A few edges cannot be directed based on the data alone, a 
wrong collider is present at Intubation, no other edge is directed incorrectly, no 
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Fig. 3. NPC learning on a sample from ALARM with a — 0.01 



extra edges are present, and one edge (Min Vol, Intubation) is missing. The result 
of applying the PC algorithm to the same set of data produced a DAG with two 
incorrect colliders, the two edges (TPR, Anaphylaxis) and (LVFailure, History) 
missing, and a few incorrect directions on edges. At the moment, the PC algo- 
rithm generates a DAG structure where some edges have been given direction 
at random. Using the NPC algorithm with a = 0.05 produce no missing edges, 
but an additional collider at Intubation. 

To evaluate the performance of the PC and NPC algorithms as a func- 
tion of the significance level a we have performed tests with values of a equal 
to 0.001, 0.01, 0.05, and 0.1. The results are shown in Table 2. The tests have 
been performed using samples generated from the ALARM network. 

The table shows the number of edges found including neighbors with the 
number of incorrect edges found in parentheses, the number of edges with correct 
orientation, the time to perform the learning in milliseconds for both algorithms. 
Furthermore, for the NPC algorithm the number of ambiguous regions and the 
number of uncertain edges in each region with the number of missing edges which 
are represented as an uncertain edge in parentheses. The values are average 
values over 25 samples of 10, 000 cases with 5% missing values (MCAR). 



Table 2. Results from using different values of a 



Algorithm 


a 


Edges 


Direction 


Time (ms) 


Regions 


Uncertain edges 


PC 


0.001 


45.25(0.5) 


44.75 


419 






PC 


0.01 


45.5(0.25) 


42.5 


426 






PC 


0.05 


44.25(0) 


41.75 


415 






PC 


0.1 


45.25(0.25) 


44 


434 






NPC 


0.001 


43.75(0) 


39 


4,015 


1.5 


5(1.25) 


NPC 


0.01 


44(0.25) 


34.5 


4,152 


2 


7(2) 


NPC 


0.05 


43(0) 


36.25 


3,805 


2.25 


9.25(2.75) 


NPC 


0.1 


44.25(0) 


38.75 


4,125 


1 


4(1) 
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Fig. 4. Results of structural learning as a function of N 
Table 3. Average run-time in seconds as a function of number of nodes 





25 


50 


75 


100 


150 


200 


PC 


0.2 


0.8 


2 


5 


16 


29 


NPC 


2 


11 


30 


71 


204 


330 



The results show that the average run-time of the PC algorithm is lower 
than that of the NPC algorithm. The PC algorithm is faster since fever tests are 
performed and since there is no notion of ambiguous regions requiring additional 
computations. The PC algorithm is able to direct more edges then the NPC 
algorithm, but some of these are directed at random in order to obtain a DAG. 

Some of the differences between the NPC and PC algorithm, which seems to 
be shortcomings of the NPC algorithm can be remedied by improving the imple- 
mentation. For instance, the principle of Occam’s Razor has not been applied to 
the ambiguous regions to reduce the number of uncertain edges in each region. 
On the ALARM network, this led to ambiguous regions containing a single edge, 
which is present in the ALARM network. 

The performances of the PC and NPC algorithms on large networks have 
been evaluated using randomly generated networks. For a fix size in terms of the 
number of variables, 10 networks with random topology (zero to five parents) and 
distribution have been generated. Each variable has from two to five states. The 
results are shown in Table 3. The time performance tests have been performed 
using 10, 000 cases with 5% missing cases (MCAR) drawn at random from the 
distribution of the network. 

All tests have been performed on a HP Omnibook xe4500 with a 1700 MHz 
Pentium 4 processor and 256MB of RAM running Linux Redhat 8. 
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Demo 

A free demo-version of the Hugin Tool can be downloaded from our web-site: 
http : //www . hugin . com. Questions related to the functionality of the Hugin Tool 
can be directed to support@hugin.com. 
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